12
OXPath: A Language for Scalable, Memory-efficient Data Extraction from Web Applications Tim Furche, Georg Gottlob, Giovanni Grasso, Christian Schallhart, Andrew Sellers Oxford University Computing Laboratory, Wolfson Building, Parks Road, Oxford OX1 3QD [email protected] ABSTRACT The evolution of the web has outpaced itself: The growing wealth of information and the increasing sophistication of interfaces ne- cessitate automated processing. Web automation and extraction technologies have been overwhelmed by this very growth. To address this trend, we identify four key requirements of web extraction: (1) Interact with sophisticated web application inter- faces, (2) Precisely capture the relevant data for most web extrac- tion tasks, (3) Scale with the number of visited pages, and (4) Read- ily embed into existing web technologies. We introduce OXPath, an extension of XPath for interacting with web applications and for extracting information thus revealed. It addresses all the above requirements. OXPath’s page-at-a-time evaluation guarantees memory use independent of the number of visited pages, yet remains polynomial in time. We validate experi- mentally the theoretical complexity and demonstrate that its evalu- ation is dominated by the page rendering of the underlying browser. Our experiments show that OXPath outperforms existing com- mercial and academic data extraction tools by a wide margin. OX- Path is available under an open source license. Categories and Subject Descriptors H.3.5 [Information Storage and Retrieval]: Online Information Services—Web-based services General Terms Languages, Algorithms Keywords Web extraction, web automation, XPath, AJAX 1. INTRODUCTION The dream that the wealth of information on the web is easily accessible to everyone is at the heart of the current evolution of the web. Due to the web’s rapid growth, humans can no longer find all relevant data without automation. Indeed, many invaluable web services (e.g., Amazon, Facebook, Pandora) already offer lim- ited automation focusing on filtering or recommendation. But in Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Articles from this volume were invited to present their results at The 37th International Conference on Very Large Data Bases, August 29th - September 3rd 2011, Seattle, Washington. Proceedings of the VLDB Endowment, Vol. 4, No. 11 Copyright 2011 VLDB Endowment 2150-8097/11/08... $ 10.00. many cases, we cannot expect data providers to comply with yet another interface suitable for automatic processing. Neither can we afford to wait another decade for publishers to implement these in- terfaces. Rather, data should be extracted from existing, human- oriented user interfaces. This lessens the burden for providers, yet allows automated processing of everything accessible to human users, not just arbitrary fragments exposed by providers. This ap- proach complements initiatives, such as Linked Open Data, which push providers towards publishing in open, interlinked formats. To enable automation, data accessible to humans through exist- ing web interfaces needs to be transformed into structured data, e.g., a gray span with class source on Google News should be rec- ognized as a news source. These observations lead us to call for a new generation of web extraction tools, which (1) can interact with rich interfaces of (scripted) web applications by simulating user ac- tions, (2) provide extraction capabilities sufficiently expressive and precise to specify the data for extraction, (3) scale well even if the number of relevant web sites is very large, and (4) are embeddable in existing programming environments for servers and clients. Why a new generation? Previous approaches to web extraction [11, 16] do not adequately address web page scripting. Where scripting is addressed [2, 17], the simulation of user actions is nei- ther declarative, nor succinct, but rather relies on action scripts and standalone, heavy-weight extraction interfaces. Though Web au- tomation tools such as [4, 8] are able to deal with scripted web applications, they are tailored to automate single action sequences and prove to be inconvenient and inefficient in large-scale data ex- traction tasks which require multi-way navigation (Section 5). In this paper, we introduce OXPath, a careful, declarative ex- tension of XPath for interacting with web applications to extract information revealed by such interactions. It extends XPath with a few concise extensions, yet addresses all the above desiderata: 1—Interaction. OXPath allows the simulation of user actions to interact with the scripted multi-page interfaces of web applica- tions: ( I ) Actions are specified declaratively through action types and context elements, such as the links to click or the form field to fill. ( II ) In contrast to most previous web extraction and automa- tion tools, actions have a formal semantics (Section 2.2) based on a ( III ) novel multi-page data model for web applications that captures both page navigation and modifications to a page (Section 2.1). 2—Expressive and precise. OXPath inherits the precise se- lection capabilities of XPath (rather than heuristics for element se- lection as in [4]) and extends them: ( I ) OXPath allows selection based on visual features. Specifically, all CSS properties are ex- posed via a new axis. ( II ) OXPath deals with navigation through page sequences, including multi-way navigation, e.g., following multiple links from the same page, and unbounded navigation se- quences, e.g., following next links on a result page until there is no

OXPath: A Language for Scalable, Memory-efficient Data ...€¦ · the web. Due to the web’s rapid growth, humans can no longer find all relevant data without automation. Indeed,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: OXPath: A Language for Scalable, Memory-efficient Data ...€¦ · the web. Due to the web’s rapid growth, humans can no longer find all relevant data without automation. Indeed,

OXPath: A Language for Scalable, Memory-efficientData Extraction from Web Applications

Tim Furche, Georg Gottlob, Giovanni Grasso, Christian Schallhart, Andrew SellersOxford University Computing Laboratory, Wolfson Building, Parks Road, Oxford OX1 3QD

[email protected]

ABSTRACTThe evolution of the web has outpaced itself: The growing wealthof information and the increasing sophistication of interfaces ne-cessitate automated processing. Web automation and extractiontechnologies have been overwhelmed by this very growth.

To address this trend, we identify four key requirements of webextraction: (1) Interact with sophisticated web application inter-faces, (2) Precisely capture the relevant data for most web extrac-tion tasks, (3) Scale with the number of visited pages, and (4) Read-ily embed into existing web technologies.

We introduce OXPath, an extension of XPath for interacting withweb applications and for extracting information thus revealed. Itaddresses all the above requirements. OXPath’s page-at-a-timeevaluation guarantees memory use independent of the number ofvisited pages, yet remains polynomial in time. We validate experi-mentally the theoretical complexity and demonstrate that its evalu-ation is dominated by the page rendering of the underlying browser.

Our experiments show that OXPath outperforms existing com-mercial and academic data extraction tools by a wide margin. OX-Path is available under an open source license.

Categories and Subject DescriptorsH.3.5 [Information Storage and Retrieval]: Online InformationServices—Web-based services

General TermsLanguages, Algorithms

KeywordsWeb extraction, web automation, XPath, AJAX

1. INTRODUCTIONThe dream that the wealth of information on the web is easily

accessible to everyone is at the heart of the current evolution ofthe web. Due to the web’s rapid growth, humans can no longerfind all relevant data without automation. Indeed, many invaluableweb services (e.g., Amazon, Facebook, Pandora) already offer lim-ited automation focusing on filtering or recommendation. But in

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. Articles from this volume were invited to presenttheir results at The 37th International Conference on Very Large Data Bases,August 29th - September 3rd 2011, Seattle, Washington.Proceedings of the VLDB Endowment, Vol. 4, No. 11Copyright 2011 VLDB Endowment 2150-8097/11/08... $ 10.00.

many cases, we cannot expect data providers to comply with yetanother interface suitable for automatic processing. Neither can weafford to wait another decade for publishers to implement these in-terfaces. Rather, data should be extracted from existing, human-oriented user interfaces. This lessens the burden for providers,yet allows automated processing of everything accessible to humanusers, not just arbitrary fragments exposed by providers. This ap-proach complements initiatives, such as Linked Open Data, whichpush providers towards publishing in open, interlinked formats.

To enable automation, data accessible to humans through exist-ing web interfaces needs to be transformed into structured data,e.g., a gray span with class source on Google News should be rec-ognized as a news source. These observations lead us to call for anew generation of web extraction tools, which (1) can interact withrich interfaces of (scripted) web applications by simulating user ac-tions, (2) provide extraction capabilities sufficiently expressive andprecise to specify the data for extraction, (3) scale well even if thenumber of relevant web sites is very large, and (4) are embeddablein existing programming environments for servers and clients.

Why a new generation? Previous approaches to web extraction[11, 16] do not adequately address web page scripting. Wherescripting is addressed [2, 17], the simulation of user actions is nei-ther declarative, nor succinct, but rather relies on action scripts andstandalone, heavy-weight extraction interfaces. Though Web au-tomation tools such as [4, 8] are able to deal with scripted webapplications, they are tailored to automate single action sequencesand prove to be inconvenient and inefficient in large-scale data ex-traction tasks which require multi-way navigation (Section 5).

In this paper, we introduce OXPath, a careful, declarative ex-tension of XPath for interacting with web applications to extractinformation revealed by such interactions. It extends XPath with afew concise extensions, yet addresses all the above desiderata:

1—Interaction. OXPath allows the simulation of user actionsto interact with the scripted multi-page interfaces of web applica-tions: (I) Actions are specified declaratively through action typesand context elements, such as the links to click or the form field tofill. (II) In contrast to most previous web extraction and automa-tion tools, actions have a formal semantics (Section 2.2) based on a(III) novel multi-page data model for web applications that capturesboth page navigation and modifications to a page (Section 2.1).

2—Expressive and precise. OXPath inherits the precise se-lection capabilities of XPath (rather than heuristics for element se-lection as in [4]) and extends them: (I) OXPath allows selectionbased on visual features. Specifically, all CSS properties are ex-posed via a new axis. (II) OXPath deals with navigation throughpage sequences, including multi-way navigation, e.g., followingmultiple links from the same page, and unbounded navigation se-quences, e.g., following next links on a result page until there is no

Page 2: OXPath: A Language for Scalable, Memory-efficient Data ...€¦ · the web. Due to the web’s rapid growth, humans can no longer find all relevant data without automation. Indeed,

Hello. Sign in to get personalised recommendations. New Customer? Start here.

Your Amazon.co.uk | Today's Deals | Gift Cards | Gifts & Wish Lists Your Account | Help

Search Books seattle

Books AdvancedSearch

BrowseGenres

New & FutureReleases Bestsellers Paperbacks Audio

BooksBargainBooks

SpecialOffers

Sell YourBooks

New ArrivalsAny Release Date

Last 30 days (147)

Last 90 days (726)

Next 90 days (96)

Department‹ Any Department

BooksTravel & Holiday (2,595)

Reference (5,574)

Art, Architecture &Photography (3,106)

Computing & Internet (2,926)

Children's Books (2,117)

Fiction (3,012)

History (7,666)

Society, Politics &Philosophy (13,480)

Science & Nature (12,056)

Sports, Hobbies &Games (2,130)

Calendars, Diaries, Annuals& More (150)

Romance (293)

Fantasy (204)

Health, Family &Lifestyle (6,082)

Music, Stage &Screen (1,989)

Comics & GraphicNovels (392)

Humour (687)

Business, Finance &Law (7,653)

Scientific, Technical &Medical (6,396)

Home & Garden (1,469)

Biography (3,071)

Horror (55)

Books › "seattle"

Showing 1 - 12 of 43,774 Results Sort by Relevance

1. Seattle: City Guide (Lonely Planet City Guide) by Becky Ohlsen(Paperback - 18 Feb 2011)

Buy new: £13.99 £9.79 16 Used & new from £6.98

Get it by Tuesday, Mar 1 if you order in the next 18 hours and choose expressdelivery.Eligible for FREE Super Saver Delivery.

Excerpt - Front Matter: "BECKY OHLSEN SEATTLE CITY GUIDE"

2. DK Eyewitness Top 10 Travel Guide: Seattle: Free pul-out map and guideby Eric Amrine (Paperback - 3 Aug 2009)

Buy new: £7.99 £5.59 25 Used & new from £2.99

Get it by Tuesday, Mar 1 if you order in the next 18 hours and choose expressdelivery.Eligible for FREE Super Saver Delivery.

(3)

3. The Rough Guide to Seattle (Rough Guide Travel Guides) by Jeff Dickey(Paperback - 29 Jun 2006)

Buy new: £10.99 £7.69 24 Used & new from £0.29

Get it by Tuesday, Mar 1 if you order in the next 18 hours and choose expressdelivery.Eligible for FREE Super Saver Delivery.

Only 3 left in stock - order soon. (2)

Excerpt - page 1: "Introduction to eattle The sparkling waterside city of Seattle is one ofAmerica's most attractive metropolises, laced with parks and lakes, and almost perfectly sitedbetween ... "

4. Seattle (AA CityPack Guides) (AA CityPack Guides) by AA Publishing

Shop All Departments Basket Wish List

Hello. Sign in to get personalised recommendations. New Customer? Start here.

Your Amazon.co.uk | Today's Deals | Gift Cards | Gifts & Wish Lists Your Account | Help

Search All Departments

What Other Customers Are Looking At Right Now

Digital Cameras Bestsellers

Shop All Departments Basket Wish List

BooksBooksKindle eBooksBooks For StudyAudio Books

Music, DVD & GamesMusicMP3 DownloadsMusical Instruments & DJDVD & Blu-rayBlu-rayPC & Video Games

KindleKindle (Wi-Fi)Kindle 3G (Free 3G + Wi-Fi)Free Kindle Reading AppsKindle AccessoriesKindle eBooksKindle NewspapersKindle MagazinesKindle StoreManage Your Kindle

ElectronicsCamera & PhotoTV & Home TheatreAudio, MP3 & AccessoriesSat Nav & Car ElectronicsPhones & AccessoriesPC & Video GamesAll Electronics

Computers & OfficePCs & LaptopsComputer AccessoriesComputer Components

Kindle 3G WirelessReading Device... Amazon£152.00

The Promise [DVD][2011]Claire Foy, ChristianCooke, Peter...DVD£19.99 £11.93

ProtectorLaurel DeweyKindle Edition£0.00

Hello. Sign in to get personalised recommendations. New Customer? Start here.

Your Amazon.co.uk | Today's Deals | Gift Cards | Gifts & Wish Lists Your Account | Help

Search History Seattle

Books AdvancedSearch

BrowseGenres

New & FutureReleases Bestsellers Paperbacks Audio

BooksBargainBooks

SpecialOffers

Sell YourBooks

New ArrivalsAny Release Date

Last 30 days (11)

Last 90 days (60)

Next 90 days (14)

Department‹ Any Department

‹ BooksHistory

Countries & Regions (2,554)

North America (1,409)

Cultural History (2,026)

Social & EconomicHistory (617)

Other HistoricalSubjects (580)

World History (1,089)

Military History (665)

Europe (358)

Britain & Ireland (112)

Essays, Journals, Letters &True Accounts (91)

Religious History (590)

Reference (211)

Maritime History (82)

Archaeology (139)

Ancient History &Civilisation (91)

Political History (417)

FormatAny Format

Hardcover (2,976)

Paperback (4,578)

Kindle Books (18)

LanguageAny Language

English (7,638)

French (1)

German (13)

Spanish (2)

Delivery Option (What's this?)

Any Delivery Option Eligible

FREE Super Saver Delivery

Avg. Customer ReviewAny Avg. Customer Review

& Up (887)

& Up (1,049)

& Up (1,076)

& Up (1,099)

PriceAny Price

Under £5 (1,376)

£5 - £10 (1,269)

£10 - £20 (2,960)

£20 - £50 (1,260)

Over £50 (790)

£ to £

AvailabilityInclude Out of Stock

Listmania!

My favourite children'soutdoor picture books: Alist by Juliet Robertson

"Creative STAR"

Great picture books tomake kids think: A list by

the colour blue

Create a Listmania! list

Search Listmania!

Books › "Seattle" › History

Showing 1 - 12 of 7,655 Results Sort by Relevance

1. Brother Eagle, Sister Sky: A Message from Chief Seattle (Picture Puffin) bySusan Jeffers (Paperback - 4 Nov 1993)

Buy new: £6.99 £5.24 37 Used & new from £0.01

Get it by Tuesday, Mar 1 if you order in the next 49 hours and choose expressdelivery.Eligible for FREE Super Saver Delivery.

(4)

2. Pitchers of Beer: The Story of the Seattle Rainiers by Dan Raley(Hardcover - 25 Aug 2011)

Buy new: £19.99 £17.99

Available for pre-orderEligible for FREE Super Saver Delivery.

3. Seattle in Black and White: The Congress of Racial Equality and the Fightfor Equal Opportunity by Joan Singler, Jean Durning and Bettylou Valentine(Paperback - 25 Mar 2011)

Buy new: £18.99 £17.09

Available for pre-orderEligible for FREE Super Saver Delivery.

4. Seattle by Joel W. Rogers (Hardcover - Aug 2007)

Buy new: £12.32 £11.14 21 Used & new from £6.51

Get it by Tuesday, Mar 1 if you order in the next 49 hours and choose expressdelivery.Eligible for FREE Super Saver Delivery.

Only 1 left in stock - order soon.

5. The World of Chief Seattle: How Can One Sell the Air? by Warren Jefferson(Paperback - Feb 2001)

Buy new: £13.50 £12.15 16 Used & new from £1.20

Get it by Tuesday, Mar 1 if you order in the next 49 hours and choose expressdelivery.Eligible for FREE Super Saver Delivery.

Only 1 left in stock - order soon.

6. Relocating to Seattle and Surrounding Areas: Everything You Need to Knowbefore You Move and after You Get There! by Guy L. Steele(Paperback - July 2000)

8 Used & new from £3.71

7. Emerald City: An Environmental History of Seattle (Lamar Series inWestern History) by Matthew Klingle (Hardcover - 4 Jan 2008)

Buy new: £25.00 £21.25 16 Used & new from £12.27

Usually dispatched within 2 to 4 weeksEligible for FREE Super Saver Delivery.

Excerpt - page 1: "ProloGuE The Fish that Might Save Seattle E very year the salmon comehome. From the deep waters of the North Pacific, millions turn toward ... "

8. The Battle of the Story of the Battle of Seattle by David Solnit and RebeccaSolnit (Paperback - 29 April 2010)

Buy new: £10.00 £9.00 21 Used & new from £3.79

Get it by Tuesday, Mar 1 if you order in the next 49 hours and choose expressdelivery.Eligible for FREE Super Saver Delivery.

Only 1 left in stock - order soon.

9. Five Days That Shook the World: The Battle for Seattle and Beyond byAlexander Cockburn and Jeffrey St. Clair (Paperback - 22 Nov 2000)

Buy new: £12.00 £10.80 31 Used & new from £1.61

Get it by Tuesday, Mar 1 if you order in the next 49 hours and choose expressdelivery.Eligible for FREE Super Saver Delivery.

Only 2 left in stock - order soon.

10. Answering Chief Seattle by Albert Furtwangler and Chief Seattle(Paperback - 15 Jun 1997)

Buy new: £15.99 £13.59 21 Used & new from £3.73

Get it by Tuesday, Mar 1 if you order in the next 46 hours and choose expressdelivery.Eligible for FREE Super Saver Delivery.

Only 1 left in stock - order soon.

Excerpt - page 3: "CHAPTER I The Legendary Tableau CHIEF SEATTLE has become worldfamous in this century for a long and moving speech he made in the 1850s, just before ... "

11. Seattle-Everett Interurban Railway (Images of Rail) by Cheri Ryan andKevin K. Stadler (Paperback - 21 April 2010)

Buy new: £13.58 £12.29 15 Used & new from £9.32

Get it by Tuesday, Mar 1 if you order in the next 47 hours and choose expressdelivery.Eligible for FREE Super Saver Delivery.

Only 2 left in stock - order soon.

12. Selling Seattle: Representing Contemporary Urban America by James Lyons(Hardcover - 1 Jun 2004)

Buy new: £16.99 £16.14 27 Used & new from £0.26

Get it by Tuesday, Mar 1 if you order in the next 49 hours and choose expressdelivery.Eligible for FREE Super Saver Delivery.

Only 1 left in stock - order soon. (2)

« Previous | Page: 1 2 3 ... | Next »

Sponsored Links (What is this?)

1. Things to Do in Seattle Find fun things to do in Seattle on Ouronline searchable calendar.

www.seattlesouthside.com/events

2. Visiting Seattle? Research Seattle Hotels and SeattleAttractions!

tripadvisor.co.uk/Seattle

Search powered by

Your Recent History (What's this?)

After viewing product detail pages or search results, look here to find an easy way to navigate back to pages you are interestedin.

› Visit the Page You Made

Get to Know Us

Careers

Investor Relations

Press Releases

Amazon and Our Planet

Make Money with Us

Sell on Amazon

Associates Programme

Fulfilment by Amazon

› See all

Let Us Help You

Delivery Rates & Policies

Amazon Prime

Returns Are Easy

Manage Your Kindle

Help

Canada China France Germany Italy Japan United States

AbeBooksRare & CollectibleBooks

AudibleDownloadAudio Books

DPReviewDigitalPhotography

IMDbMovies, TV& Celebrities

Javari UKShoes& Handbags

Javari FranceShoes& Handbags

Javari JapanShoes& Handbags

ShopbopDesignerFashion Brands

SoundUnwoundEditable MusicEncyclopedia

Conditions of Use & Sale Privacy Notice © 1996-2011, Amazon.com, Inc. or its affiliates

Shop All Departments Basket Wish List

booktitle:

price:

{click /}

{click /}

SeattleBooks

{click /}

1 2

3

45

+ +

5 star: (4)4 star: (0)3 star: (0)2 star: (0)1 star: (0)

Help other customers find the most helpful reviews Was this review helpful to you?

Report abuse | Permalink

Comment

By "oldgrungewoman" - See all my reviews

Help other customers find the most helpful reviews Was this review helpful to you?

Report abuse | Permalink

Comment

By Amandap - See all my reviews

Help other customers find the most helpful reviews Was this review helpful to you?

Report abuse | Permalink

Comment

Prompts for sign-in

Guidelines

Only search this product's forum

Share your own customer images

Publisher: learn how customers can search inside thisbook.

Hello. Sign in to get personalised recommendations. New Customer? Start here.

Your Amazon.co.uk | Today's Deals | Gift Cards | Gifts & Wish Lists Your Account | Help

Search Books

Books AdvancedSearch

BrowseGenres

New & FutureReleases Bestsellers Paperbacks Audio

BooksBargainBooks

SpecialOffers

Sell YourBooks

Brother Eagle, Sister Sky: A Message fromChief Seattle (Picture Puffin) [Paperback]Susan Jeffers (Author)

(4 customer reviews)

RRP: £6.99Price: £5.24 & this item Delivered FREE in the

UK with Super Saver Delivery. See detailsand conditions

You Save: £1.75 (25%)

In stock.Dispatched from and sold by Amazon.co.uk. Gift-wrap available.

Only 9 left in stock--order soon (more on the way).

Want guaranteed delivery by Tuesday, March 1? Order it inthe next 49 hours and 1 minute, and choose Express delivery atcheckout. See Details

18 new from £1.85 19 used from £0.01

Frequently Bought Together

Price For All Three: £11.73

Customers Who Bought This Item Also Bought

How the Stars Fell intothe Sky: A NavajoLegend (SandpiperHoughton Mifflin Books)by Jerry Oughton

(2)

£3.99

North American Indians(Pictureback Series)by Marie Gorsline

(1)

£2.50

More Than Moccasins:A Kid's Activity Guideto Traditional NorthAmerican Indian Lifeby Laurie Carlson

(1)

£7.72

Hiawatha (PicturePuffin)by Henry Longfellow

(3)

£5.24

They Dance in the Sky:Native American StarMythsby Jean Guard Monroe

(1)

£5.00

› Explore similar items

Product detailsPaperback: 32 pagesPublisher: Puffin; New Ed edition (4 Nov 1993)Language EnglishISBN-10: 014054514XISBN-13: 978-0140545142Product Dimensions: 29.8 x 23.6 x 0.6 cmAverage Customer Review: (4 customer reviews)

Amazon Bestsellers Rank: 83,281 in Books (See Top 100 in Books)#58 in Books > History > North America > Native Americans

Would you like to update product info or give feedback on images?

More About the Author

Discover books, learn about writers, and more.

› Visit Amazon's Chief Seattle Page

Product Description

Product DescriptionNearly 150 years ago, Chief Seattle, a respected and peaceful leader ofone of the Northwest Indian Nations, delivered a message tothegovernment in Washington who wanted to buy his people's land. Hebelieved that all life on earth, and the earth itself, is sacred.Amoving and compelling plea for an end to man's destruction of nature.

Tags Customers Associate with This Product (What's this?)Click on a tag to find related items, discussions, and people.

bestsellers (1)

Your tags: Add your first tag

Search Products Tagged with

What Do Customers Ultimately Buy After Viewing This Item?

85% buy the item featured on this page:Brother Eagle, Sister Sky: A Message from Chief Seattle (Picture Puffin) by Susan Jeffers Paperback (4)

£5.24

4% buyNorth American Indians (Pictureback Series) by Marie Gorsline Paperback (1)

£2.50

4% buyHow the Stars Fell into the Sky: A Navajo Legend (Sandpiper Houghton Mifflin Books) by Jerry Oughton Paperback (2)

£3.99

4% buyHiawatha (Picture Puffin) by Henry Longfellow Paperback (3)

£5.24

› Explore similar items

Customer Reviews4 Reviews

Average Customer Review

(4 customer reviews)

Share your thoughts with other customers:

Most Helpful Customer Reviews

10 of 10 people found the following review helpful: A beautiul and thought-provoking book for all ages,

21 Oct 1999By A CustomerThis review is from: Brother Eagle, Sister Sky: A Message from Chief Seattle (PicturePuffin) (Paperback)

I bought this book for my young and spiritually aware 4 year oldthinking that I would read it with her when she is a little older. Theillustrations are delightful and the text is so easy to understand that I'vebeen reading it to her now. This is a special book with a vital messageto us all. The sooner our children understand the message it carries, thebetter chance they have of improving our world. He says "The earthdoes not belong to us. We belong to the earth." I still have a lump inmy throat and a tear in my eye when I read it. If you can't buy it for achild, buy it for yourself - it's beautiful.

6 of 6 people found the following review helpful: A beautiful teaching resource, 29 July 2004

This review is from: Brother Eagle, Sister Sky: A Message from Chief Seattle (PicturePuffin) (Paperback)

I used this book recently with a year three class. It followed an in-schooldrama performance of a story relating to Irish migration, and thesubsequent frontier movement of the pioneers. The children were totallyfascinated by the story and were able to take on board the fact, that;although based on truth and actual events, the words said to be spokenby Chief Seattle are open to question.The paintings are beautiful, as is the poetic language. I intend to usethis text again. It has potential across the curriculum, providing excellentLiteracy opportunities as well as starting points for other areas such as;PSHE, Geography, History, RE, Music and Art.This is a valuable resource. It is unfortunate that we will never know an'absolute' version of Chief Seattle's speech but, Jeffers' simpleinterpretation makes his intention widely accessible, and can only serveto invoke greater interest in the tragedy of the Northwest Indian tribes.

Caring for the natural world, 25 Feb 2011

This review is from: Brother Eagle, Sister Sky: A Message from Chief Seattle (PicturePuffin) (Paperback)

Well written and super illustrations. Clearly conveys message that weneed to look after the natural world - fitted in superbly into RE teachingsyllabus and cost very resaonable.

Share your thoughts with other customers:

› See all 4 customer reviews...

Most Recent Customer Reviews

Beautiful book but be aware ofthe problems.Without doubt this is a beautifully producedbook and is ideal in terms of consciousness-raising. So, on face-value terms, a valuableaddition to a child's library. Read morePublished on 5 Dec 1999 by [email protected]

Search Customer Reviews

Only search this product's reviews

› See all 4 customer reviews...

Customers Also Bought Items By

Customer DiscussionsThis product's forum

Discussion Replies Latest Post

No discussions yet

Ask questions, Share opinions, Gain insight

Start a new discussionTopic:

First post:

Receive e-mail when new posts are made

Active discussions in related forumsDiscussion Replies Latest Post

childrens booksNon-stereotypical gender roles 4 4 hours ago

childrens booksGirl aged 7, nearly 8, reading age 14. Anyideas?

108 9 hours ago

childrens booksRecommendations for 2-year-old please! 33 18 hours ago

childrens bookspuffing billy the toy town train 1 18 hours ago

childrens booksCan anyone suggest some books for anadvanced 10 year old boy

65 19 hours ago

childrens booksFiction for 5 to 7 year olds that cover roadsafety

2 19 hours ago

childrens booksJamila Gavin book for a7 yr old 6 1 day ago

childrens booksKids on Kindles? 4 1 day ago

Search Customer Discussions

Related forums

childrens books (583 discussions)

Listmania!

Great reads for 10 and 11 year olds: A list by April Wallis"scary_aunty"

Fab picture books for children of all ages: A list by April Wallis"scary_aunty"

Books for Forest School Leaders: A list by Louise Ambrose

Create a Listmania! list

Search Listmania!

Look for similar items by categoryBooks > Children's Books > FictionBooks > History > North America > Native Americans

Look for similar items by subject Children's Books History / Native American Juvenile Fiction Children's Fiction Nature & the Natural World - General People & Places - United States - Native American Children's & young adult fiction & true stories General fiction (Children's/YA) History & the past: general interest (Children's/YA) North America American Indians Land Tenure

Find products matching ALL checked subjects i.e., each product must be in subject 1 AND subject 2 AND ...

FeedbackWould you like to update product info or give feedback on images?I am the Author, and I want to comment on my book.I am the Publisher, and I want to comment on this book.

Amazon.co.uk Privacy Statement Amazon.co.uk Delivery Information Amazon.co.uk Returns & Exchanges

Get to Know Us

Careers

Investor Relations

Press Releases

Amazon and Our Planet

Make Money with Us

Sell on Amazon

Associates Programme

Fulfilment by Amazon

› See all

Let Us Help You

Delivery Rates & Policies

Amazon Prime

Returns Are Easy

Manage Your Kindle

Help

Canada China France Germany Italy Japan United States

AbeBooksRare & CollectibleBooks

AudibleDownloadAudio Books

DPReviewDigitalPhotography

IMDbMovies, TV& Celebrities

Javari UKShoes& Handbags

Javari FranceShoes& Handbags

Javari JapanShoes& Handbags

ShopbopDesignerFashion Brands

SoundUnwoundEditable MusicEncyclopedia

Conditions of Use & Sale Privacy Notice © 1996-2011, Amazon.com, Inc. or its affiliates

Shop All Departments Basket Wish List

Quantity: 1

orSign in to turn on 1-Click ordering.

More Buying Choices

37 used & new from £0.01

Have one to sell?

Share

Tell the Publisher! I’d like to read this book on Kindle

Don t have a Kindle? Get yourKindle here, or download a FREEKindle Reading App.

Formats Amazon Price New from Used from

Library Binding -- £13.47 £19.16

Paperback £5.24 £1.85 £0.01

Unknown Binding -- -- £1.32

See our great savings on 1000s of books in our Seasonal Offers.

This item: Brother Eagle, Sister Sky: A Message from Chief Seattle (Picture Puffin) by Susan JeffersPaperback £5.24How the Stars Fell into the Sky: A Navajo Legend (Sandpiper Houghton Mifflin Books) by Jerry OughtonPaperback £3.99North American Indians (Pictureback Series) by Marie Gorsline Paperback £2.50

Your Recent History (What's this?)

After viewing product detail pages or search results, look here to find an easy way to navigate back to pages you are interestedin.

› Visit the Page You Made

publisher:

doc("amazon.co.uk")//field()[@title=’Search in’]/{"Books"}Ê/following::field()[@title=’Search for’]/{"Seattle" /}Ë//field()[@alt=’Go’]/{click /}Ì//a[*.refinementLink[.#=’History’]]/{click /}Í//*.result:<book>[.//a.title:<title=(.)>/{click /}Î//b[.#=’Publisher’]/following-sibling::*:<publisher=(.)>][.//span.price[1]:<price=(.)>]

Figure 1: Finding an OXPath through Amazon.

further. (III) OXPath enables the identification of data for extrac-tion, which can be assembled into (hierarchical) records, regard-less of its original structure. (IV) Based on the formal semanticsof OXPath (Section 2.2), we show that its extensions considerablyincrease the language’s expressiveness (Section 2.3).

3—Scale. OXPath scales well both in time and memory: (I) Weshow that OXPath’s memory requirements are independent of thenumber of pages visited (Section 4). To the best of our knowl-edge, OXPath is the first web extraction tool with such a guarantee,as confirmed by a comparison with five commercial and academicweb extraction tools. (II) We show that the combined complexityof evaluating OXPath remains polynomial (Section 2.3) and is onlyslightly higher than that of XPath (Section 4). (III) We also showthat OXPath is highly parallelizable (Section 2.3). (IV) We ver-ify these theoretical results in an extensive experimental evaluation(Section 5), showing that OXPath outperforms existing extractiontools on large scale experiments by at least one order of magnitude.

4—Embeddable, standard API. OXPath is designed to in-tegrate with other technologies, e.g., Java, XQuery, or Javascript.Following the spirit of XPath, we provide an API and host languageto facilitate OXPath’s interoperation with other systems (Section 3).

Bonus: Open Source. On http://diadem-project.info/

oxpath, we provide under the new BSD license the OXPath pro-totype and API along with some illustrative examples.

1.1 Scenario: History Books on SeattleTo extract history books on Seattle currently offered on amazon.

co.uk, a user has to perform the following sequence of actionsto retrieve the page listing these books (see Figure 1): (1) Select“Books” from the “Search in” select box, (2) enter “Seattle” intothe “Search for” text field, and (3) press the “Go” button to startthe search. On the returned page, (4) refine the search to only “His-tory” books. Figure 1 shows an OXPath expression that realizesthis navigation in Lines 1–4 (the red numbers in Figure 1 refer tothe user actions above). To select the two input fields, we use OX-Path’s field() node-test (matching only visible form elements) and

each node’s title attribute (@title). A contextual action (enclosed in{}) selects “Books” from the select box and continues the naviga-tion from that field. The other actions are absolute (with an added/ before the closing brace) where navigation continues at the rootof the page retrieved by the action. To select the “History” link,we adopt the . notation from CSS for selecting elements with aclass attribute refinementLink and use OXPath’s #= shorthand forXPath’s contains() function for identifying the “History” text.

Once on that page, we want to extract the book title, price, andpublisher. Lines 5–7 of Figure 1 show how to achieve this extrac-tion: We identify the record boundaries as the element with classresult and instruct OXPath to label these records with book usingthe record extraction marker :<book>. From there, we navigate tothe contained title links, extract their value as a title attribute, andclick on the link to get to the page for the individual book, where wefind and extract the publisher. Finally, we extract the price. Notethat it is on the previous page, but the user does not need to care ofthe order pages are visited in or if they need to be buffered. OX-Path buffers pages when necessary, yet guarantees that the numberof buffered pages is independent of the number of visited pages.

In Appendix A and on diadem-project.info/oxpath we showfurther scenarios for using OXPath: searching for a flight on kayak.

co.uk (Figure 9) where interaction with a visual, scripted interfaceis required to find non-stop flights to Seattle, finding relevant aca-demic papers and their citations from Google Scholar (Figure 11),and extracting stock quotes from Yahoo Finance (Figure 10).

1.2 OXPath by ExampleOXPath extends XPath with four concepts: Actions to navigate

the interface of web applications, means for interacting with highlyvisual pages, extraction markers to specify data to extract, and theKleene star for extraction from a set of pages with unknown extent.

Actions. For simulating user actions such as clicks or mouse-overs, OXPath introduces contextual, as in {click}, and absoluteaction steps with a trailing slash, as in {click /}. Since actionsmay modify or replace the DOM, we assume that they always re-turn a new DOM. Absolute actions return DOM roots, contextualactions return the nodes in the new DOM matched by the action-free prefix (Section 2.2) of the performed action, which is obtainedfrom the segment starting at the previous absolute action by drop-ping all intermediate contextual actions and extraction markers.

Style Axis and Visible Field Access. For lightweight vi-sual navigation we expose the computed style of rendered HTMLpages with a new axis for accessing CSS DOM node properties anda new node test for selecting only visible form fields. The style

axis navigates the actual CSS properties of the DOM style object,e.g., it is possible to select nodes by their (rendered) color or fontsize. To ease fields navigation, we introduce the node-test field(),that relies on the style axis to access the computed CSS style toexclude not visible fields, e.g., /descendant::field()[1] selects thefirst visible field in document order.

Extraction Marker. In OXPath, we introduce a new kind ofqualifier, the extraction marker, to identify nodes as representativesfor records and to form attributes from extracted data. For example,doc("news.google.com")//div[@class~="story"]:<story>

[.//h2:<title=string(.)>][.//span[style::color="#767676"]:<source=string(.)>]

extracts a story element for each current story on Google News,along with its title and sources (as strings), producing:

<story><title >Tax cuts ...</title><source>Washington Post</source><source>Wall Street Journal</source> ... </story>

Page 3: OXPath: A Language for Scalable, Memory-efficient Data ...€¦ · the web. Due to the web’s rapid growth, humans can no longer find all relevant data without automation. Indeed,

The nesting in the result mirrors the structure of the OXPath ex-pression: extraction markers in a predicate (title, source) repre-sent attributes to the last marker outside the predicate (story).

Kleene Star. Finally, we add the Kleene star, as in [12]. Forexample, the following expression queries Google for “Oxford”,traverses all accessible result pages and extracts all links.

doc("google.com")/descendant::field()[1]/{"Oxford"}/following::field()[1]/{click /}

/( /descendant::a:<Link=(@href)>[.#="Next"]/{click /} )*

To limit the range of the Kleene star, one can specify upper andlower bounds on the multiplicity, e.g., (...)*{3,8}.

1.3 Related WorkWe briefly summarise the state-of-the art in web automation and

extraction and contrast it to OXPath. In particular, we consider(1) filling and submitting a web form, (2) multi-way navigation,e.g., following multiple links from the same page, (3) memorymanagement for large scale extraction (4) and availability of anAPI as concrete criteria for the requirements discussed above. Therelevant tools can be divided into three classes:

Full-fledged, stand-alone extraction tools (as Lixto [2]) are atleast as expressive as OXPath but are outclassed by OXPath both inextraction speed and memory by a wide margin (Section 5). Lixto,Visual Web Ripper (visualwebripper.com), and Web Content Ex-tractor (newprosoft.com/web-content-extractor.htm) are mov-ing towards interactive wrapper generator frameworks, recordinguser actions in a browser and replaying these actions for extract-ing data. None of these systems addresses memory managementand our experimental evaluation (Section 5) demonstrates that suchsystems indeed require memory linear in the number of accessedpages—in contrast to OXPath.

Extraction languages [13, 1, 14, 11, 16, 15] use a declara-tive approach, much like OXPath; however, they often do not ad-equately facilitate deep web interaction such as form submissionsmainly due to their age. Also, they do not provide native constructsfor page navigation, apart from retrieving a page given a URI.The BODED extraction language (in BODE [17] system) is ableto deal with modern web applications. Similar to OXPath, it isbrowser-based but its scripts are imperative and XML based; mem-ory management is not considered and it employs browser replica-tion for multi-way navigation that imposes a performance penalty,making BODE unsuitable for large-scale data extraction.In the evaluation, we compare also with Web Harvest(web-harvest.sourceforge.net), a recent, open source exampleof an extraction language. Extraction tasks are specified as imper-ative scripts (in XML files). It does not deal with web applications(form filling) and does not give access to the rendered page, butrather to a cleaned XML view of HTML documents.

Web automation tools are mainly focused on single navigationsequences through web applications, but do not consider large-scale web extraction and mostly do not support multi-way navi-gation. Coscripter [8] and iMacros (iopus.com/iMacros) are ex-amples of such tools: They do not allow multi-way navigation dueto limited support for iterative and conditional programming. Veg-emite [9] is a CoScripter extension that facilitates some extractioncapability such as population of tables. However, as its authorsnote, this interactivity comes at a price to performance as the samepage may be reloaded many times. Also, page state is not preservedand thus some web applications may not behave as expected.The same applies to Chickenfoot [4], a language for web automa-tion that allows users to program scripts that run in Firefox. Itenables interaction with forms as well as loading and navigating

〈expr〉 ::= 〈funct〉 ‘(’ ( 〈expr〉 ( ‘,’ 〈expr〉)* )? ‘)’| ‘(’ 〈expr〉 ‘)’ | 〈xpath-value〉 | 〈expr〉 〈binop〉 〈expr〉| 〈xpath-unop〉 〈expr〉 | 〈path〉 ( ‘|’ 〈path〉 )*

〈path〉 ::= 〈estep〉 ( ‘/’ 〈path〉)?〈estep〉 ::= 〈step〉 | 〈action〉 | 〈kleene〉〈kleene〉 ::= ‘(’ 〈path〉 ‘)∗’ ( ‘{’ 〈number〉 ‘,’ 〈number〉 ‘}’ )?〈step〉 ::= 〈axis〉 ‘::’ 〈node-test〉 | 〈step〉 〈qualifier〉 | 〈step〉 〈marker〉〈funct〉 ::= 〈xpath-funct〉 | ‘doc’〈axis〉 ::= 〈xpath-axis〉 | ‘style’〈node-test〉 ::= (〈xpath-nt〉 | ‘ f ield()’ ) ( ( ‘.’ | ‘#’ ) 〈ident〉 )?〈qualifier〉 ::= ‘[’ 〈expr〉 ‘]’ | ‘[?’ 〈expr〉 ‘]’〈marker〉 ::= ‘:<’ 〈name〉 ( ‘=’ 〈expr〉 )? ‘>’〈action〉 ::= ‘{’ (〈ident〉 | 〈xpath-value〉) ‘/’ ? ‘}’〈binop〉 ::= 〈xpath-binop〉 | ‘∼=’ | ‘# =’

Figure 2: OXPath Syntax.

pages. Multi-way navigation is feasible, but only by explicitly us-ing “back” instructions which command the browser to return tothe previous page. Thus, page buffering is unnecessary, but pagestate is not preserved and pages need to be rendered again.

In contrast to supervised tools, unsupervised web extraction tools(see survey in [5]) require little or no input from the user. They fo-cus on automated analysis rather than extraction and are not directlycomparable to OXPath. Furthermore, publicly available systems(such as [6]) do not deal with scripted web applications.

2. LANGUAGEThe syntax of OXPath is defined in Figure 2 (XPath literals as in

[7, 3]). We impose additional restrictions on OXPath expressions toavoid interactions of functions and sorting operators with OXPath’sactions and extraction expressions (see Appendix B).

We choose XPath rather than XQuery as underlying language,since XPath provides sufficient expressiveness for most tasks if ex-tended with data extraction features not found in either XPath orXQuery, yet allows strong guarantees on time and memory bounds.

2.1 Data ModelOXPath’s data model extends the data model of XPath to multi-

ple pages (i.e., HTML documents), interconnected by actions. OX-Path expressions are evaluated on relational structures, called pagetrees, with schema

((Rν )ν∈NodeSets,Rchild,Rnext-sibl,(Rα )α∈Acts

)where (Rν )ν∈NodeSets is the set of (unary) node-set relations (typeand label relations), Rchild the parent-child and Rnext-sibl the directsibling relation. The remaining XPath axes (such as descendant)are derived from the basic relations as usual [3]. We add a set ofbinary relations (Rα )α∈Acts, one for each OXPath action. A tuple(x,y) ∈ Rα indicates that the action α triggered on node x yieldsthe page rooted at y. (Rα )α∈Acts together with Rchild form a tree:Each page has a unique parent node connected by a single actionedge. For instance, if two links point to the same URI this yieldstwo different pages in the page tree.

An OXPath expression E returns an unordered XPath tree, i.e., aset of tuples over the schema ((Rν )ν∈NodeSets,Rchild). This allowsthe extraction of (possibly nested) records with multiple values. Werefer to nodes in this tree (in the page tree) as output (input) nodes.

Figure 3 illustrates both the data model and the result of an OX-Path expression. The page tree consists of the nodes of the consid-ered pages, connected by child, next-sibl, and (labeled) action edges(in the figure we use only {click/}). Each distinct path of actionsand nodes leads to a distinct page. OXPath traverses action edgesonly in the direction of the edge (no reverse navigation), and onlydirectly (no descendant over action edges). The part of the page

Page 4: OXPath: A Language for Scalable, Memory-efficient Data ...€¦ · the web. Due to the web’s rapid growth, humans can no longer find all relevant data without automation. Indeed,

results

… …

Rchild Rnext-sibl R{click /}

DOM node extracted record extracted value page

accessed

Figure 3: OXPath page tree.

tree (the pages and nodes) accessed by a given OXPath expressionis called the accessed page tree (accessed pages and nodes).

Figure 3 also shows the result of an OXPath expression: Fora node matching a record extraction markers (such as :<R>), a newnode in the output tree is created and labeled R. For nodes matchingvalue extraction markers (such as :<R=string(.)>), two new nodesin the output tree are created, one is labeled R, the other is a textnode containing the extracted value and is created as a child of thefirst node. The structure of the output tree reflects the structure ofthe extraction markers in the OXPath expression (but not that ofthe input tree): For instance //a:<R>[parent::b:<S>] returns a treewhere the output nodes for S are children of the output nodes for R.

2.2 SemanticsThe semantics of OXPath is defined in terms of its extraction

semantics (JexprKE (c)). The extraction semantics specify the con-struction of the result tree and uses OXPath’s value semantics(JexprKV (c)) for matching nodes from the page tree. The value se-mantics loosely follows the XPath semantics in [3]: Parametrizedby an OXPath expression expr, it takes a node n, supplemented intoa context tuple c, and computes the value reached via expr. Thisvalue is either a set of context tuples, a Boolean, integer, or stringvalue—as in XPath.

Each context tuple—and here the OXPath semantics differs fromXPath—c = 〈n, p, l〉 consists of an input node n, the parent outputnode p, and the last sibling output node l, also accessed as c.n,c.p, c.l. We maintain both the parent and sibling match to organizethe extracted nodes hierarchically: Subsequent extraction matchesoutside predicates yield nodes that are descendants of p and thussiblings of l. On entering a predicate, l replaces p as parent outputnode such that extraction matches yield nodes that are descendantsof l (nested records) rather than further siblings.

We start the evaluation of an OXPath expression with the ini-tial context node c = 〈⊥,〈>,results〉 ,〈>,results〉〉, where ⊥is an arbitrary context node and 〈>,results〉 is the root of allsubsequent extraction matches. The context node is arbitrary as anexpression always starts with doc(uri) which returns the root of thepage with the given uri for any context node.

The complete semantics is given in Appendix C. The extractionsemantics is fairly straightforward. Thus, we focus here on the

V1 J pathKV (c) = J pathKN (c)

N1 Jestep/pathKN (c) = {c′′ | c′ ∈ JestepKN (c)∧ c′′ ∈ JpathKN (c′)}

N2 Jaxis::nodesKN (c) = {〈n′,c.p,c.l〉 | Raxis(c.n,n′)∧n′ ∈ Rnodes}N3 Jstep[q]KN (c) = {c′ ∈ JstepKN (c) | JqKB (〈c′.n,c′.l,c′.l〉)}N4 Jstep±[qp]KN (c) = {c′ ∈ Jstep± KN (c) |C = Jstep± KN (c)∧

J REWRITE±(qp,C,c′)KB (〈c′.n,c′.l,c′.l〉)}N5 J{action /}KN (c) = {〈n′,c.p,c.l〉} with n′ such that Raction(c.n,n′)N6 J{action}KN (c) = J AFP(action,c.n)KN (J{action /}KN (c))

N7 Jstep : 〈M[= v]〉KN (c) = {〈c′.n,c′.p,OUT(c′.n,M)〉 | ∃c′ ∈ JstepKN (c)}N8 J(path)∗KN (c) = {cr | ∃r ≥ 0 ∀0≤ s≤ r :

cs+1 ∈ JpathKN (cs)∧ c0 = c}N9 J(path)∗{v,w}KN (c) = {cr | ∃v≤ r ≤ w ∀0≤ s≤ r :

cs+1 ∈ JpathKN (cs)∧ c0 = c}

Table 1: Value Semantics of OXPath.value semantics. For an OXPath expression that is also an XPathexpression, it computes the exact same result as standard XPath.In Table 1 we highlight only the major differences to the XPathsemantics necessitated by OXPath extensions.

J pathKV delegates the evaluation of a path to J pathKN (V1),which handles expressions computing a node sets. The rules forother values are omitted here (see Appendix C). Paths are evaluatedusing J pathKN. A path is composed into OXPath steps (N1) andeach type of step is treated in N2–N9:

N2–N4: Axes, node-tests, and predicates. OXPath axes, node-tests, and predicates are treated as in standard XPath. When en-tering predicates, the last sibling output node is copied to the par-ent output node. Expressions in predicates are cast to Booleans bymeans of Jexpr KB as in XPath (see [3] and Appendix C). For po-sitional predicates (N4) we replace, using REWRITE±, in qp eachnon-nested occurrence of last() with |C| and of position() withthe position of c′ within C. The order depends on the axis in step.

N5–N6: Actions. Actions map the context node c.n to a nodein a different page. Absolute actions (N5) map c.n to the root n′

of some other page with (Rα )α∈Acts (by the data model there canbe only a single such node). For contextual actions in rule N6,we first evaluate the absolute action using N5, but then attempt tomove to a node in the new page that corresponds closely to c.n inthat it is selected by the same OXPath expression used to select c.nin the old page. The AFP of action and c.n in an OXPath expres-sion E returns that OXPath expression:1 Let base be the subex-pression of E between action and its last preceding absolute ac-tion, stripped of all contextual actions and extraction markers. ThenAFP(action,c.n) = (base)[i] where i is the positon of c in documentorder among all nodes in the current page matching base. In thecurrent page, this expression uniquely identifies c.n. When evalu-ated from the root returned by the absolute action on c.n, it selectsa node reached by the same path and in the same relative position.

N7: Extraction markers. For extraction markers, the context setof the corresponding step is modified by replacing the last siblingmarker with OUT(c′.n,M) where OUT is an injective function thatmaps an input node and extraction marker to an output node.

N8–N9: Kleene star. For unbounded and bounded Kleene stars,the Kleene-repeated path is matched multiple times; if necessary,bounds on the number of repetitions are enforced.

2.3 ComplexityConsidering the complexity of OXPath, we note that expressions

containing Kleene star repeated actions may require access to anunbounded number of pages. In particular, when we evaluate such

1For sake of brevity, we omit E where clear from context.

Page 5: OXPath: A Language for Scalable, Memory-efficient Data ...€¦ · the web. Due to the web’s rapid growth, humans can no longer find all relevant data without automation. Indeed,

Embe

ddi

ng

OX

Pat

h En

gine

Web

Acc

ess

Web Access API

HtmlUnit

PAAT Algorithm

Basic OXPath Evaluator

OXPath API JAXP XPath API

Host Environment (Java, XQuery, CLI)

XPath Variables for Form Filling

DB

Visual OXPath

OXPath Parser & Rewriting

Browser XPathNod

e M

anag

er

SWT Mozilla

Page Buffer

Page Modification Man.

Figure 4: OXPath System Architecture.an expression, we do not know whether the evaluation terminatesand how many pages are accessed during evaluation. Thus, whenwe discuss the complexity of evaluating OXPath, we only considerexpressions whose evaluations terminate and consider all accessedpages as input. Furthermore, we assume that traversing an actiontakes constant time. There are few web pages where actions (trig-gering a Javascript execution) are not executed quickly.

THEOREM 1 (COMPLEXITY). OXPath query evaluation with-out string concatenation and multiplication is in NLOGSPACE fordata complexity. OXPath query evaluation is PTIME-complete forcombined complexity.

We give the proof in Appendix D along with a discussion of otherlanguage properties of OXPath, viz. that no two nodes from differ-ent pages need to be sorted and that no output buffer is required.

3. SYSTEM ARCHITECTUREOXPath uses a three layer architecture as shown in Figure 4, pro-

moting modularity and allowing substitution of web browsers andhost environments:

(1) The Web Access Layer enables programmatic access to apage’s DOM as rendered by a modern web browser. This is re-quired to evaluate OXPath expressions which interact with webapplications or access computed CSS style information. The weblayer is implemented as a facade which promotes interchangeabil-ity of the underlying browser engine (Firefox and HTMLUnit).

(2) The Engine Layer consists of two main components: theOXPath parser/rewriter and the PAAT algorithm (Section 4). Ba-sic OXPath steps, i.e., subexpressions without actions, extractionmarkers and Kleene stars, are directly handled by the browser’sXPath engine. The PAAT algorithm buffers pages as well as de-tects and manages page modifications. When a page is to be re-moved from the buffer, the corresponding nodes are also droppedby the node manager. The buffer manager retrieves new pages and,when necessary, makes page copies before modification by actions.

(3) The Embedding Layer facilitates the integration of OXPathwithin other systems and provides a host environment to instantiateOXPath expressions. The host environment provides variable bind-ings from databases, files, or even other OXPath expressions foruse within OXPath. To facilitate OXPath integration, we slightlyextend the JAXP API to provide an output stream for extracted data.

Though OXPath can be used in any number of host languagessuch as Java, XSLT, or Ruby, we designed a lightweight host lan-guage for large-scale data extraction. It is a SQL-like relationalquery language with OXPath subqueries, grouping, and aggrega-tion. We separate these tasks, as well as the provision of variablebindings, from the core language to preserve the declarative natureof OXPath and to guarantee efficient evaluation of the core features.

OXPath is complemented by a visual user interface, a Firefoxadd-on that records mouse clicks, keystrokes, and other actions in

1 Function eval−(Expr,c,prot):

2 if Expr≡ ε then return {c};3 if Lookup[Expr,c] 6= null then return Lookup[Expr,c];4 Expr≡ et where e matches one of the following cases;

5 Result← /0;6 if e≡ axis :: nodes then7 Ctx←{〈n′,c.p,c.l〉 | Raxis(c.n,n′) and n′ ∈ Rnodes};8 for c′ ∈ Ctx do9 prot′← isLastIn(c′,Ctx)?prot : true;

10 Result← Result∪eval−(t,c′,prot′);

11 else if e≡ [q] then12 if eval(q,〈c.n,c.l,c.l〉 ,(t ≡ ε)?prot : true) 6= /0 then13 Result← eval−(t,c,prot)

14 else if e≡ :〈M〉 or e≡ :〈M = v〉 then15 output RM(OUT(c.n,M)), Rchild(OUT(c.n,M),c.p);16 if e≡ :〈M = v〉 then17 val← eval(v,c,(t ≡ ε)?prot : true);18 output Rval(c′), Rchild(c′,OUT(c.n,M)) | c′ new output node;

19 Result← eval−(t,〈c.n,c.p,OUT(c.n,M)〉 ,prot);

20 Lookup[Expr,c]← Result;21 return Result;

Algorithm 1: PAAT Simple Evaluation.

sequence to construct an equivalent OXPath expression. It also al-lows the selection and annotation of nodes used to construct a gen-eralised extraction expression. We are actively improving the visualinterface and developing a visual debugger for OXPath.

4. PAGE-AT-A-TIME EVALUATIONThe evaluation algorithm of OXPath is dominated by two issues:

(1) How to minimize the number of page buffers without reloadingpages, and (2) how to guarantee an efficient, polynomial time evalu-ation of individual pages. To avoid reloading pages, we need to visiteach page exactly once, and hence we have to retain each page untilall its descendants have been processed. To address (1), our page-at-a-time algorithm traverses the page tree in a depth-first man-ner without retaining information on formerly visited pages. How-ever, a naive depth-first evaluation within individual pages wouldcause a worst-case exponential runtime and violate (2), necessi-tating memoization of intermediate results. As a solution to bothrequirements, we employ two mutually recursive evaluation proce-dures eval−(Expr,c,prot) and eval(Expr,Ctx,prot), shown in Al-gorithms 1 and 2. Both take an expression Expr, either a single con-text tuple c or a set Ctx of such tuples referring to the same page,and a Boolean flag prot, indicating whether the page referred toby the context is needed after the current invocation—and is there-fore protected. We evaluate a general OXPath expression Expr witheval(Expr,{〈⊥,〈>,results〉 ,〈>,results〉〉}, false).

4.1 PAAT Simple EvaluationThe first procedure, eval− (Algorithm 1), applies memoization

to evaluate simple expressions which may contain actions or Kleenestars only in predicates but not in the main path. For brevity, weonly show the evaluation of the most important expressions, i.e.,axis navigation with node-tests, predicates, and extraction markers.

In line 2 we check for empty expressions as base case and returnthe input context tuple c as result {c} of the evaluation. Next, inline 3, we check whether the result of applying Expr to c has beenmemoized, and if so, return this result. Otherwise, we evaluate Exprand store the result at the end of the algorithm in line 20.

The main part of the algorithm is organized in a depth-first man-ner: The algorithm splits the input expression Expr into prefix eand remainder t (line 4) to evaluate e directly and t recursively.

Page 6: OXPath: A Language for Scalable, Memory-efficient Data ...€¦ · the web. Due to the web’s rapid growth, humans can no longer find all relevant data without automation. Indeed,

1 Function eval(Expr,Ctx,prot):

2 if Expr is simple then3 return

⋃c∈Ctx eval−(Expr,c,isLastIn(c,Ctx)?prot : true);

4 Expr≡ het where h is simple, e matches one of the following;5 Ctx←

⋃c∈Ctx eval−(h,c, true);

6 if Ctx = /0 then return /0;

7 Result← /0;8 if e≡ {action/}∨{action} then9 for c ∈ Ctx do

10 prot′← isLastIn(c,Ctx)?prot : true;11 c′← getPage(c,action,prot′);12 if e≡ {action} then13 if isPageUnmodified() then c′← c;14 else c′← eval(AFP(action,c.n),c′, true);15 Result← Result∪eval(t,{c′}, false);16 FreeMem(pageof(c′));

17 else if e≡ (path)∗{v,w} then18 if containsAction(path) then19 if w = 0 then Result← Result∪eval(t,Ctx,prot);20 else21 if v≤ 0 then Result← Result∪eval(t,Ctx, true);22 if w > 1 then Expr′← path path∗{v−1,w−1} t;23 else Expr′← path t;24 Result← Result∪eval(Expr′,Ctx,prot);

25 else26 Ctx′← Ctx;27 for i← 0 to w−1 do28 Ctx← eval(path,Ctx,(t ≡ ε ∧ i = w−1)?prot : true);29 if i≥ v then Ctx← Ctx\Ctx′; Ctx′← Ctx′ ∪Ctx;30 if Ctx = /0 then break;

31 Result← eval(t,Ctx′,prot);

32 return Result;Algorithm 2: PAAT Full Evaluation.

Thereby, e is either an axis with node test, a predicate, or an ex-traction marker, leading to the following case distinction: (1) Foraxis navigation, starting at line 6, we obtain a new context set Ctxwith Raxis and Rnodes and evaluate t recursively on each c′ ∈ Ctx. Ifprot is set or c′ is not the last tuple in the iteration, the current pagemust be protected since it is needed subsequently (line 9). (2) Inline 11, we deal with predicate expressions [q]. We evaluate q witheval, since q may contain actions or Kleene stars. Since extractedrecords are structured according to the nesting of extraction mark-ers in predicates, the last sibling match c.l becomes the new parentmatch. (3) In line 14, for extraction markers, we output the markednode, and—if the marker computes a value—output the evaluationof v of the marked node. After outputing the match OUT(c.n,M),we set the last sibling in the context accordingly.

We denote the sub-language of OXPath which contains no ac-tions or Kleene stars at all as OXPathbasic. For OXPathbasic expres-sions, the only recursive call to eval, can be replaced with calls toeval− to obtain a self-contained algorithm for OXPathbasic.

4.2 PAAT Full EvaluationIn Algorithm 2, we show the procedure eval for full OXPath.

Essentially, eval deals with three cases: Actions, Kleene stars con-taining actions in the main path, and other Kleene expressions. Inthe first two cases, eval performs a depth-first traversal in the pagetree, the latter is solved with a set-based approach, as all resultingnodes are on the same page. As invariance throughout the recur-sion, the input set Ctx always contains nodes from a single page.

In line 2, we delegate simple Expr to eval−. Next, we split Exprinto three parts: h is the maximum prefix which is simple, e is eitheran action or a Kleene star, and t is the remaining expression. The

prefix h is evaluated with eval−, and the result is returned if empty.(1) The first main case deals with actions, starting at line 8.

Roughly, we iterate over all c in Ctx, obtain c′ via the action, andevaluate t on c′. This is not exponential, since actions cannot betraversed twice from the same node (no reverse traversal). Morespecifically, we protect the page to perform the action upon, ei-ther if the input flag prot is set, or if c is not last in the iteration(line 10). If the page is protected, getPage opens a new buffer forthe page accessed through the action and returns the tuple c′ refer-ring to the root of the new page. If not, the page is replaced and allmemoization information of eval− for the old page is freed. If theaction is contextual and did not change the page, we stay at c andavoid evaluating the action free prefix AFP(action,c.n). Otherwise,if the page has been modified, we apply AFP(action,c.n) to obtainc′. Either way, we evaluate t recursively on c′, descending one stepfurther in the depth-first traversal of the page tree (line 15). We setthe protection to false, since we free the page and all memoizedinformation on this page after the invocation in any case (line 16).(2) Below line 18, we handle Kleene star expressions with actionson the main path. If the upper iteration bound w has reached 0,we evaluate t and exit (line 19). Otherwise, we also evaluate t,but only if the lower bound v has reached 0 (line 21), then in-stantiate one copy of the Kleene repeated path, and decrement thebounds (line 22) to evaluate the resulting expression Expr′ recur-sively. (3) Finally we deal with the remaining Kleene star expres-sions (line 25). They do not leave the current page; thus, exponen-tially many paths may reach the same node. We apply path repeat-edly until Ctx becomes empty or the upper bound w is reached. Tovisit tuples only once, we store in Ctx′ all visited tuples and keep inCtx only the new ones (line 29). Finally we apply t to Ctx′ (line 31).

4.3 Analysis of PAATThe proofs of the following results are given in Appendix E.PROPOSITION 2. Evaluating an OXPathbasic expression on a

context tuple c using eval− takes O(n6 · q2) time and O(n5 · q2)space where q is the size of the expression and n the number ofnodes in the page of c.

Evaluating OXPath expressions containing unbounded Kleenestars and actions may cause non-termination, if there are infinitelymany paths in the page tree matched by the Kleene star expression.Such cases occur (1) if there are cycles in the web graph matchedby the expression; (2) if web sites serve (different) answers for aninfinite number of URIs, e.g., by including its page impressions.

We investigate the complexity of OXPath−Kleene which containsno Kleene stars, OXPathbounded with only bounded Kleene stars,and full OXPath over page trees of bounded depth.

PROPOSITION 3. Evaluating an OXPath−Kleene expression us-ing eval takes O((p · n)6 · q3) time and O(n5 · q3) space with q asabove, n the maximum number of nodes in an accessed page, andp the number of such pages.

THEOREM 4 (FULL OXPATH). Evaluating a full OXPath ex-pression using eval takes O((p ·n)6 ·q3) time and O(n5 · (q+d)3)space where q, p, and n are as above, and d the depth of the ac-cessed page tree.

For OXPathbounded, we note that d is bounded by the product ofall Kleene star upper bounds and obtain the following result:

COROLLARY 5. Evaluating a fixed OXPath expression withoutunbounded Kleene stars takes only constantly many page buffers.

THEOREM 6 (MEMORY OPTIMALITY). There exists an OX-Path expression and a page tree for which every algorithm thatcomputes J ·KV without prior knowledge of the page tree requiresat least as many page buffers as PAAT—if no page is loaded twice.

Page 7: OXPath: A Language for Scalable, Memory-efficient Data ...€¦ · the web. Due to the web’s rapid growth, humans can no longer find all relevant data without automation. Indeed,

0

10

20

30

40

50

60

0 50 100 150 200 250 300 350 400 450 0

200

400

600

800

1000

1200m

em

ory

[M

B]

pages

time [sec]

memoryvisited pages

(a) Many pages.

0

50

100

150

200

0 2 4 6 8 10 12 0

20

40

60

80

100

120

140

160

mem

ory

[M

B]

#pages [1000] / #re

sults [100,0

00]

time [h]

memoryextracted matches

visited pages

(b) Millions of results.

0

10

20

30

40

50

10 20 30 40 50 60 0

20

40

60

80

100

mem

ory

[M

B]

#m

atc

hes [100] / #pages

time [sec]

memoryextracted matches

visited pages

(c) Large pages.Figure 5: Scaling OXPath: Memory, visited pages, and output size vs. time.

PAAT (2%)browser initialization (13%)page rendering (85%)

(a) Profiling OXPath (avg.).

0

10

20

30

40

50

D1 D2 D3 D4 D5

tim

e [

se

c]

query set

browser initialization

page rendering

PAAT evaluation

(b) Profiling OXPath (per query).

0

0.1

0.2

0.3

0.4

1 2 3 4 5 6

tim

e [

se

c]

#actions

absolutecontextual

(c) Contextual actions.

Figure 6: Profiling OXPath’s components.

5. EVALUATIONIn this section, we confirm the scaling behaviour of OXPath:(1) The theoretical complexity bounds from Section 4.3 are con-

firmed in several large scale extraction experiments in diverse set-tings, in particular the constant memory use even if extracting mil-lions of records from hundreds of thousands of web pages.

(2) We illustrate that OXPath’s evaluation is dominated by thebrowser rendering time, even for complex queries on small webpages. None of the extensions of OXPath (Section 1.2) significantlyaffects the scaling behaviour or the dominance of page rendering.

(3) In an extensive comparison with commercial or academicweb extraction tools, we show that OXPath outperforms previousapproaches by at least an order of magnitude, although they aremore limited. Where OXPath requires memory independent of thenumber of accessed pages, most others use linear memory.

Scaling: Millions of Results at Constant Memory. Wevalidate the complexity bounds for OXPath’s PAAT algorithm andillustrate its scaling behaviour by evaluating three classes of queriesthat require complex page buffering. Figure 5(a) shows the resultsof the first class, which searches for papers on “Seattle” in GoogleScholar and repeatedly clicks on the “Cited by” links of all resultsusing the Kleene star operator. The query descends to a depth of3 into the citation graph and is evaluated in under 9 minutes (93pages/min). Pages retrieved are linear w.r.t. time, but memory sizeremains constant even as the number of pages accessed increase.The jitter in memory use is due to the repeated ascends and de-scends of the citation hierarchy. Figure 5(b) shows the same testsimulating Google Scholar pages on our web server (to avoid overlytaxing Google’s servers). The number in brackets indicate how wescale the axes. OXPath extracts over 7 million pieces of data from140,000 pages in 12 hours (194 pages/min) with constant memory.

We conduct similar tests (repeatedly clicking on all links) fortasks with different characteristics: one on very large pages (Wiki-pedia) and one on pages with many results (product listings onGoogle) from different web shops. Figures 5(c) and 12 (in Ap-pendix E) again show that memory is constant w.r.t. to pages visitedeven for the much larger pages on Wikipedia and the often quitevisually-involved pages reached by the Google product search.

Profiling: Page Rendering is Dominant. We profile eachstage of OXPath’s evaluation performing five sets of queries on thefollowing web sites: apple.com (D1), diadem-project.info (D2),bing.com (D3), www.vldb.org/2011/ (D4), and the Seattle pageon Wikipedia (D5). On each, we click on all links and extract thehtml tag of each result page. Figures 6(a) and 6(b) show the totalaverage and the individual averages for each site, respectively. Forbing.com, the page rendering time and the number of links is verylow, and thus also the overall evaluation time. Wikipedia pages,on the other hand, are comparatively large and contain many links,thus the overall evaluation time is high.

The second experiment analyses the effect of OXPath’s actionson query evaluation, comparing time performance of contextualand absolute actions. Our test performs actions on pages that donot result in new page retrievals. Figure 6(c) shows the results withqueries containing one to six actions on Google’s advanced prod-uct search form. Contextual actions suffer a small but insignificantpenalty to evaluation time compared to their absolute equivalents.

Comparison: Order-of-Magnitude Improvement. OX-Path is compared to four commercial web extraction tools, WebContent Extractor (WCE), Lixto, Visual Web Ripper (VWR), theacademic web automation and extraction system Chickenfoot andthe open source extraction toolkit Web Harvest. Where the firstthree can express at least the same extraction tasks as OXPath (and,e.g., Lixto goes considerably beyond), Chickenfoot and Web Har-vest require scripted iteration and manual memory management formany extraction tasks, in particular where multi-way navigation isneeded. We do not consider tools such as CoScripter and iMacrosas they focus on automation only and offer no iterative constructsas required for extraction tasks. We also do not consider tools suchas RoadRunner [6] or XWRAP [10] as they work on single pagesand lack the ability to traverse to new web pages.

In contrast to OXPath, many of these tools cannot processscripted websites easily. Thus, we compare using an extractiontask on Google Scholar, which does not require scripted actions.On heavily scripted pages, the performance advantage of OXPathis even more pronounced. With each system, we navigate the cita-

Page 8: OXPath: A Language for Scalable, Memory-efficient Data ...€¦ · the web. Due to the web’s rapid growth, humans can no longer find all relevant data without automation. Indeed,

0

100

200

300

400

500

600

700

800

0 20 40 60 80 100 120 140

tim

e [

se

c]

#pages

OXPathWeb Content Extractor

LixtoVisual Web Ripper

Web HarvestChickenfoot

(a) Evaluation time, <150 p.

0

100

200

300

400

500

600

700

0 20 40 60 80 100 120 140

tim

e w

/o p

ag

e lo

ad

ing

[se

c]

#pages

OXPathWeb Content Extractor

LixtoVisual Web Ripper

Web HarvestChickenfoot

(b) Norm. evaluation time, <150 p.

0

200

400

600

800

1000

1200

1400

1600

0 100 200 300 400 500 600 700 800

tim

e w

/o p

ag

e lo

ad

ing

[se

c]

Number of pages

OXPathLixto

Web HarvestChickenfoot

(c) Norm. evaluation time, <850 p.

Figure 7: Comparison.

0

50

100

150

200

250

300

350

0 100 200 300 400 500 600 700 800

me

mo

ry [

MB

]

#pages

OXPathLixto

Web HarvestChickenfoot

Figure 8: Comparison: Memory (<850 p.).

tion graph to a depth of 3 for papers on “Seattle” (see Appendix F).We record evaluation time and memory for each system. In Fig-

ure 7(a), we show the (averaged) evaluation time for each system upto 150 pages. Though Chickenfoot and Web Harvest do not man-age page and browser state or do not render pages at all, OXPathstill outperforms them. Systems that manage state similar to OX-Path are between two and four times slower than OXPath even onthis small number of pages. Figure 7(b) show the evaluation timediscounting the page loading, cleaning, and rendering time. Thisallows for a more balanced comparison as the different browser en-gines or web cleaning approaches used in the systems affect theoverall runtime considerably. Figure 7(c) shows the same evalua-tion time up to 850 pages, but omits WCE and VWR as they werenot able to run these tests. Again both figures show a considerableadvantage for OXPath (at least one order of magnitude). Finally,Figure 8 illustrates the memory use of these systems. Again WCEand VWR are excluded, but they show a clear linear trend in mem-ory usage in the tests we were able to run. Among the systems inFigure 8, only Web Harvest comes close to the memory usage ofOXPath, which is not surprising as it does not render pages. Yet,even Web Harvest shows a clear linear trend. Chickenfoot exhibitsa constant memory use just as OXPath, though it uses about tentimes more memory in absolute terms. The constant memory isdue to Chickenfoot’s lack of support for multi-way navigation thatwe compensate by using the browser’s history. This forces reload-ing when a page is no longer cached and does not preserve pagestate, but requires only a single active DOM instance at any time.We also tried to simulate multi-way navigation in Chickenfoot, butthe resulting program was too slow for the tests shown here.

6. CONCLUSION AND FUTURE WORKTo the best of our knowledge, OXPath is the first web extrac-

tion system with strict memory guarantees, which reflect stronglyin our experimental evaluation. We believe that it can become animportant part of the toolset of developers interacting with the web.

We are committed to building a strong set of tools around OX-Path. We provide a visual generator for OXPath expressions and

a Java API based on JAXP. Some of the issues raised by OXPaththat we plan to address in future work are: (1) OXPath is amenableto significant optimization and a good target for automated gener-ation of web extraction programs. (2) Further, OXPath is perfectlysuited for highly parallel execution: Different bindings for the samevariable can be filled into forms in parallel. The effective parallelexecution of actions on context sets with many nodes is an openissue. (3) We plan to further investigate language features, such asmore expressive visual features and multi-property axes.

7. REFERENCES[1] G. O. Arocena and A. O. Mendelzon. WebOQL: Restructuring

documents, databases, and webs. In ICDE, 24–33, 1998.[2] R. Baumgartner, S. Flesca, and G. Gottlob. Visual web

information extraction with Lixto. In VLDB, 119–128, 2001.[3] M. Benedikt and C. Koch. XPath Leashed. CSUR,

41(1):3:1–3:54, 2007.[4] M. Bolin, M. Webber, P. Rha, T. Wilson, and R. C. Miller.

Automation and customization of rendered web pages. InUIST, 163–172, 2005.

[5] C.-H. Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan. Asurvey of web information extraction systems. TKDE,1411–1428, 2006.

[6] V. Crescenzi and G. Mecca. Automatic information extractionfrom large websites. JACM, 51(5):731âAS779, 2004.

[7] G. Gottlob, C. Koch, and R. Pichler. Efficient algorithms forprocessing XPath queries. TODS, 30(2):444âAS491, 2005.

[8] G. Leshed, E. M. Haber, T. Matthews, and T. Lau. CoScripter:automating & sharing how-to knowledge in the enterprise. InCHI, 1719–1728, 2008.

[9] J. Lin, J. Wong, J. Nichols, A. Cypher, and T. A. Lau. End-user programming of mashups with Vegemite. In IUI, 97–106,2009.

[10] L. Liu, C. Pu, and W. Han. XWRAP: An XML-enabledwrapper construction system for web information sources. InICDE, 611–621, 2000.

[11] M. Liu and T. W. Ling. A rule-based query language forHTML. In DASFAA, 6–13, 2001.

[12] M. Marx. Conditional XPath. TODS, 30(4):929–959, 2005.[13] A. O. Mendelzon, G. A. Mihaila, and T. Milo. Querying the

World Wide Web. In DIS, 80–91, 1996.[14] A. Sahuguet and F. Azavant. Building light-weight wrappers

for legacy web data-sources using W4F. In VLDB, 738–741,1999.

[15] N. Sawa, A. Morishima, S. Sugimoto, and H. Kitagawa.Wraplet: Wrapping your web contents with a lightweightlanguage. In SITIS, 387–394, 2007.

[16] W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan.Declarative information extraction using datalog withembedded extraction predicates. In VLDB, 1033–1044, 2007.

[17] J.-Y. Su, D.-J. Sun, I.-C. Wu, and L.-P. Chen. On design ofbrowser-oriented data extraction system and plug-ins. JMST,18(2):189–200, 2010.

Page 9: OXPath: A Language for Scalable, Memory-efficient Data ...€¦ · the web. Due to the web’s rapid growth, humans can no longer find all relevant data without automation. Indeed,

APPENDIXA. FURTHER EXAMPLESWWW Papers with their citations. We might want to ex-tract the most relevant papers of a scientific field together withother work citing them. Figure 11 shows the necessary user ac-tions: (1) Enter “world wide web” in the search field and (2) press“search”. From the result page we extract the title and authors ofa paper and (3) click on its “cited by” link. Having extracted theciting papers for that paper we do the same for all papers on thecurrent result page and (4) click on the “Next” to retrieve the nextresult page, where we continue with (3). For sake of space, weextract only the citing papers from the first page, but it is easy toextract all citing papers.

Figure 11 shows an OXPath solution for this extraction task:Lines 1–2 fill and submit the search form. Line 3 realizes the iter-ation over the set of result pages by repeatedly clicking the “Next”link. It uses the Kleene star borrowed from Regular XPath [12] toexpress this navigation. Lines 4–5 identify a result record and itsauthor and title, lines 6–8 navigate to the cited-by page and extractthe papers. The expression yields nested records of the followingshape:

<paper><title>The diameter of the world wide web</title><authors>R Albert, H Jeong, AL Barabasi</authors><cited_by><title>Emergence of scaling in random ... </title><authors>AL Barabasi...</authors></cited_by>...</paper>

Stock Quotes from Yahoo Finance. Figure 10 illustrates anOXPath expression extracting stock quotes from Yahoo Finance. Inparticular, note the use of optional predicates ([? ]) for conditionalextraction: If the change is formatted in red, it is prefixed with aminus, otherwise with a plus.

B. SYNTACTIC RESTRICTIONSAs extraction markers and actions have side effects, the grammar

in Figure 2 omits a few restrictions necessary to avoid undesirableinterplay of expression markers and actions with functions, pred-icates and sorting operators: (1) Actions and extraction markersmay not occur in extraction marker values, arguments of functionsand comparison operators, or in bracketed expressions. (2) Extrac-tion markers may only occur in a predicate, if there is a marker on

flightprice: airline:

London Seattle

{cli

ck /

}

{click /}

1 2

3

6

plane:

{click /}

54

doc("kayak.co.uk")//input#origin/{"London"}/following::input#destination/{"Seattle"}/following::input[@name=’Search’]/{click /}//*#stops1/{click }/following::*#stops2/{click /}//tbody.resultrow:<flight>[.//a.results_price:<price=(.)>][.//a.resultdetaillink/{click /}//*.flight_detailsExtra:<plane=(substring-before(.,’|’))>]

Figure 9: OXPath for Flights to Seattle

Searchthe Web

Yahoo! Finance Home Investing Your Money Finance News & Comment My Portfolios UK Activity International Funds Currencies Research & Education

Hot topics: FTSE100 FTSE250 All Share AIM Sector Indices Dow Jones Nasdaq Dax 30 EuroStoxx 50

Enter Company or Symbol UK Get Quote | Search Y! Finance Search the Web

MORE ON ^FTAS

QuotesSummary

ComponentsHistorical Prices

ChartsInteractive

Basic Chart

Technical Charts

News & InfoHeadlines

Components Get Components for: GO

1 - 50 of 627 | First | Previous | Next | LastCOMPONENTS FOR ^FTAS

Symbol Name Last Trade Change Volume

3IN.L 3I INFRASTRUCTURE 113.98 p 12:29PM 0.98 (0.87%) 256,927

888.L 888 HOLDINGS 44.50 p 12:12PM 4.25 (8.72%) 69,840

AAIF.L ABERDEEN ASIAN INC 162.75 p 12:46PM 0.74 (0.45%) 4,876

AAL.L ANGLO AMERICAN 2,878.00 p 12:55PM 65.50 (2.23%) 1,098,574

AAS.L ABERDEEN ASIAN SMLR 630.00 p 12:43PM 13.00 (2.11%) 27,241

ABD.L ABERDEEN NEW DAWN 874.25 p 12:09PM 13.25 (1.49%) 6,378

ABF.L ASSOCIAT BRIT FOODS 1,054.00 p 12:55PM 6.00 (0.57%) 240,449

ABG.L AFRICAN BARR GOLD 547.50 p 12:53PM 2.50 (0.45%) 93,806

ABR.L ABSOLUT RET TST-PRP 114.25 p 11:57AM 0.50 (0.44%) 157,234

ADM.L ADMIRAL GROUP 1,610.00 p 12:47PM 10.00 (0.62%) 94,912

ADMF.L ADV DEVE MKT GBP 474.00 p 12:15PM 5.50 (1.17%) 40,132

ADN.L ABERDEEN ASSET MGMT 174.20 p 12:42PM 0.50 (0.29%) 176,301

AEP.L ANGLO EAST PLANT 605.00 p 11:08AM 4.00 (0.67%) 4,187

AFR.L AFREN 130.20 p 12:51PM 2.20 (1.66%) 584,612

AGA.L AGA RANGEMASTER 88.60 p 12:09PM 0.60 (0.68%) 784

AGK.L AGGREKO 1,674.00 p 12:49PM 5.00 (0.30%) 196,880

ADVERTISEMENT

Yahoo! UK & Ireland My Yahoo! Mail Search

Welcome, Tim[Sign Out, My Account] Finance Home - Help

Wednesday,27 October 2010, 1:11pm - UK Markets close in 3 hours and 19 minutes.

FTSE ALL-SHARE (^FTAS) At 12:55PM : 2,939.02 14.85 (0.50%)

stock name:

{cli

ck /

}

change:

Searchthe Web

Yahoo! Finance Home Investing Your Money Finance News & Comment My Portfolios

12:29pm: 113.98 p 0.98 (0.87%)

More On 3IN.L

QuotesSummaryLast TradeHistorical Prices

ChartsInteractiveBasic ChartTechnical Chart

Technical AnalysisDetailed Data

News & InfoHeadlinesMessage Board

CompanyProfileDirector Dealings

Analyst CoverageAnalyst OpinionsAnalyst EstimatesUp/Downgrades

OwnershipMajor Holders

Last Trade: 113.98 pTrade Time: 12:29pm

Change: 0.98 (0.87%)

Prev Close: 113.00

Open: 113.30

Bid: 113.80

Ask: 114.00

1y Target Est: 120.00 p

Day's Range: 113.30 - 114.10

52wk Range: 106.00 - 115.00

Volume: 256,927

Avg Vol (3m): 592,965

Market Cap: 925.13M

P/E (ttm): 10.27 x

EPS (ttm): 11.10 p

Div & Yield: 5.50 (4.87%)

3I INFRASTRUCTURE ( LSE: 3IN.L / ISIN JE00B1RJLF86 ) Buy/Sell for £9.95

1d 5d 3m 6m 1y 2y 5ycustomise chart

Add 3IN.L to Your PortfolioDownload DataFinance updates on yourmobile

Headlines

HSBC close to sale of train firm - sourcesReuters MOLT (Fri, 8 Oct)

PRESS DIGEST - The Financial Times -Sept 17Other (Fri, 17 Sep)

» More Headlines for 3IN.LAdd 3IN.L Headlines to My Yahoo!

ADVERTISEMENT

UK International Activity Currency Converter Research & Education

Hot topics: FTSE 100 FTSE250 All Share AIM Sector Indices Dow Jones Nasdaq Dax 30 EuroStoxx 50

Enter company or symbol UK Get Quotes Finance Search

Yahoo! UK & Ireland My Yahoo! Mail Search

Welcome, Tim[Sign Out, My Account] Finance Home - Help

Check out our new charts

3I INFRASTRUCTURE (3IN.L)

FTSE 100 0.47%

{click /}cap:

doc("uk.finance.yahoo.com/q/cp?s=%5EFTAS")//table[tbody/tr/td[.=’Name’]]//tr[not(position() = 1)]:<stock>[td[2]:<name=string(.)>][td[4][? .[img/@alt=’Up’]:<change=concat(’+’,.)>]

[? .[img/@alt=’Down’]:<change=concat(’-’,.)>]]

Figure 10: OXPath for stock quotes

the path leading to the predicate and the last such marker does notextract a value. (3) Extraction markers that extract a value may notoccur outside a predicate.

In addition, due to OXPath operating over multiple HTML pagesrather than a single XML document, we also enforce the follow-ing: (4) The id function is replaced with the CSS-style # node-test.(5) OXPath does not allow sorting of node sets with nodes fromdifferent pages. To that end, bracketed expressions such as (//a |

//b) are not allowed. (6) OXPath expressions always start with anexplicit document access doc(uri).

C. FULL OXPATH SEMANTICSIn Section 2.2 we highlight where the OXPath semantics differs

from the one for XPath. Table 2 and 3 give the full OXPath valueand extraction semantics, though we omit for space reasons thoserules in the extraction semantics that recursively decompose theexpression. Ff , Bo and Uo are the semantic functions on the nodescorresponding to the functions and operators of XPath, extendedby the OXPath . and # node-tests and #= and .= operators. Forsimplicity, we disallow positional functions outside qualifiers.

The extraction semantics Jexpr KE (c) for OXPath in Table 3,takes context tuple c and extracts an output tree from the input pagetree. The semantics is straightforward except for extraction markersdiscussed below: For expressions with sub-expressions using a dif-ferent context, we retrieve the new context using the node seman-tics J KN and return the markers extracted by all sub-expressions.E1 and E3 show this case exemplary for paths and predicates. Forall other expressions not shown here, we just collect all extractionmarkers returned by their subexpressions (if there are any), regard-less of the (value) semantics of the involved expressions.

Extraction markers are treated in rule E7.1 and E7.2: In rule E7.1for markers without extracted values, a tuple RM(OUT(c′.n,M)) isextracted for each reached tuple c′ in C = JstepKN (c) and relatedto its parent match via Rchild(OUT(c′.n,M),c′.p). For markers withvalues, we evaluate additionally in rule E7.2 the value expression vfor each reached tuple c′: We take the value returned by JvKE (c

′)(which is a string or other scalar value) and add a child text-node toOUT(c′.n,M) with content JvKE (c

′) (i.e., is added to RJvKE(c′).

Page 10: OXPath: A Language for Scalable, Memory-efficient Data ...€¦ · the web. Due to the web’s rapid growth, humans can no longer find all relevant data without automation. Indeed,

V1 J pathKV (c) = J pathKN (c)

V2 J f ct(e1, . . . ,ek)KV (c) = Ff ct(Je1 KV (c), . . . ,Jek KV (c))

V3 JvalueKV (c) = value

V4 Je1 op e2 KV (c) = Bop(Je1 KV ,Je2 KV)

V5 Jop eKV (c) = Uop(JeKV)

N1 Jestep/pathKN (c) = {c′′ | c′ ∈ JestepKN (c)∧ c′′ ∈ JpathKN (c′)}

N2 Jaxis::nodesKN (c) = {〈n′,c.p,c.l〉 | Raxis(c.n,n′)∧n′ ∈ Rnodes}N3 Jstep[q]KN (c) = {c′ ∈ JstepKN (c) | JqKB (〈c′.n,c′.l,c′.l〉)}N4 Jstep±[qp]KN (c) = {c′ ∈ Jstep± KN (c) |C = Jstep± KN (c)∧

J REWRITE±(qp,C,c′)KB (〈c′.n,c′.l,c′.l〉)}N5 J{action /}KN (c) = {〈n′,c.p,c.l〉} with n′ such that Raction(c.n,n′)N6 J{action}KN (c) = J AFP(action,c.n)KN (J{action /}KN (c))

N7 Jstep : 〈M[= v]〉KN (c) = {〈c′.n,c′.p,OUT(c′.n,M)〉 | ∃c′ ∈ JstepKN (c)}N8 J(path)∗KN (c) = {cr | ∃r ≥ 0 ∀0≤ s≤ r :

cs+1 ∈ JpathKN (cs)∧ c0 = c}N9 J(path)∗{v,w}KN (c) = {cr | ∃v≤ r ≤ w ∀0≤ s≤ r :

cs+1 ∈ JpathKN (cs)∧ c0 = c}N10 J(expr)[qp]KN (c) = {c′ | c′ ∈C∧C = JexprKN (c)∧

J REWRITE+(qp,C,c′)KB (〈c′.n,c′.l,c′.l〉)}N11 Jdoc(uri)KN (c) = Fdoc(uri)

B1 Jexpr KB (c) = UBoolean(Jexpr KV (c))

Table 2: Value Semantics of OXPath

E1 Jestep/pathKE (c) = JestepKE (c)∪⋃

c′∈JestepKNJpathKE (c

′)

E3 Jstep[q]KE (c) = JstepKE (c)∪⋃c′∈JstepKN(c)

JqKE (〈c′.n,c′.l,c′.l〉)E7.1 Jstep : 〈M〉KE (c) = JstepKE (c)∪{RM(OUT(c′.n,M)),

Rchild(OUT(c′n,M),c′.p) | c′ ∈ JstepKN (c)

E7.2 Jstep : 〈M = v〉KE (c) = JstepKE (c)∪⋃

c′∈C JvKE (c′)∪

RM(OUT(c′.n,M)),Rchild(OUT(c′.n,M),c′.p),RJvKN(c′)(c

′′)),Rchild(c′′,OUT(c′.n,M)) |c′ ∈ JstepKN (c)∧ c′′ new output node

Table 3: Extraction Semantics of OXPath (partial)

Web Images Videos Maps News Shopping Gmail more ! [email protected]

world wide web Search Advanced Scholar Search

Scholar Articles and patents anytime include citations Create email alert

The diameter of the world wide webR Albert, H Jeong, AL Barabási - Arxiv preprint cond-mat/9907038, 1999 - arxiv.orgarXiv:cond-mat/9907038v2 [cond-mat.dis-nn] 10 Sep 1999 The diameter of the world wide webDespite its increasing role in communication, the world wide web (www) remains the leastcontrolled medium: any individual or institution can create websites with unrestricted ... Cited by 2497 - Related articles - View as HTML - All 52 versions

Self-similarity in World Wide Web traffic: evidence and possible causesME Crovella, A Bestavros - Networking, IEEE/ACM ", 2002 - ieeexplore.ieee.orgAbstract—Recently, the notion of self-similarity has been shown to apply to wide-area and local-area network traffic. In this paper, we show evidence that the subset of network traffic that is due to World Wide Web (WWW) transfers can show characteristics that are consistent ... Cited by 3122 - Related articles - BL Direct - All 56 versions

World-wide web: The information universeT Berners-Lee, R Cailliau, JF Groff, B " - Internet ", 2010 - emeraldinsight.comPurpose – The World-Wide Web (W 3 ) initiative is a practical project designed to bring a global information universe into existence using available technology. This paper seeks to describe the aims, data model, and protocols needed to implement the “web” and to compare them ... Cited by 1919 - Related articles - Check Availability - BL Direct - All 26 versions

[BOOK] Information architecture for the world wide webL Rosenfeld, P Morville - 1998 - portal.acm.orgInformation Architecture for the World Wide Web is for webmasters, designers, and anyone else involved in building a web site. It's for novice web designers who, from the start, want to avoid the traps that result in poorly designed sites. It's for experienced web designers who have ... Cited by 1030 - Related articles - Library Search - All 19 versions

[BOOK] Weaving the Web: The original design and ultimate destiny of the World Wide Web inventor

title:

authors:

Web Images Videos Maps News Shopping Gmail more ! [email protected] | Scholar Preferences | My Account

world wide web Search Advanced Scholar Search

Scholar Articles and patents anytime include citations Create email alertResults 1 - 10 of about 2,340,000

Create email alert

Result Page: 1 2 3 4 5 6 7 8 9 10 Next

world wide web Search

Go to Google Home - About Google - About Google Scholar

©2010 Google

The diameter of the world wide webR Albert, H Jeong, AL Barabási - Arxiv preprint cond-mat/9907038, 1999 - arxiv.orgarXiv:cond-mat/9907038v2 [cond-mat.dis-nn] 10 Sep 1999 The diameter of the world wide webDespite its increasing role in communication, the world wide web (www) remains the leastcontrolled medium: any individual or institution can create websites with unrestricted ... Cited by 2497 - Related articles - View as HTML - All 52 versions

[PDF] from arxiv.org

Self-similarity in World Wide Web traffic: evidence and possible causesME Crovella, A Bestavros - Networking, IEEE/ACM ", 2002 - ieeexplore.ieee.orgAbstract—Recently, the notion of self-similarity has been shown to apply to wide-area and local-area network traffic. In this paper, we show evidence that the subset of network traffic that is due to World Wide Web (WWW) transfers can show characteristics that are consistent ... Cited by 3122 - Related articles - BL Direct - All 56 versions

[PDF] from psu.eduFind it @ Oxford

World-wide web: The information universeT Berners-Lee, R Cailliau, JF Groff, B " - Internet ", 2010 - emeraldinsight.comPurpose – The World-Wide Web (W 3 ) initiative is a practical project designed to bring a global information universe into existence using available technology. This paper seeks to describe the aims, data model, and protocols needed to implement the “web” and to compare them ... Cited by 1919 - Related articles - Check Availability - BL Direct - All 26 versions

[PDF] from cern.ch

[BOOK] Information architecture for the world wide webL Rosenfeld, P Morville - 1998 - portal.acm.orgInformation Architecture for the World Wide Web is for webmasters, designers, and anyone else involved in building a web site. It's for novice web designers who, from the start, want to avoid the traps that result in poorly designed sites. It's for experienced web designers who have ... Cited by 1030 - Related articles - Library Search - All 19 versions

[HTML] from informationr.net

[BOOK] Weaving the Web: The original design and ultimate destiny of the World Wide Web by itsinventorT Berners-Lee, M Fischetti - 1999 - portal.acm.orgTim Berners-Lee, the inventor of the World Wide Web, has been hailed by Time magazine as one of the 100 greatest minds of this century. His creation has already changed the way people do business, entertain themselves, exchange ideas, and socialize with one another.. " ... Cited by 983 - Related articles - Library Search - All 13 versions

[PDF] Data preparation for mining world wide web browsing patternsR Cooley, B Mobasher, J Srivastava" - Knowledge and Information ", 1999 - Citeseer... The World Wide Web (WWW) continues to grow at an astounding rate in both the sheer volumeof traffic and the size and complexity of Web sites. The complex- ... Data Pr eparati on for MiningWo r ld WideWeb Br owsing Patt e rns ... T able 1. Common Characteristics of Web Pages ... Cited by 1267 - Related articles - View as HTML - Check Availability - All 38 versions

[PDF] from psu.edu

Consumer reactions to electronic shopping on the World Wide WebSL Jarvenpaa, PA Todd - International Journal of Electronic ", 1996 - portal.acm.orgMuch fascination and speculation surrounds the impact of the World Wide Web on consumer shopping behavior. At the same time, there is little empirical evidence underlying all this speculation. This article provides one such data set. It reports on factors that consumers ... Cited by 719 - Related articles - Check Availability - BL Direct - All 2 versions

Searching the world wide webS Lawrence, CL Giles - Science, 1998 - sciencemag.orgThe coverage and recency of the major World Wide Web search engines was analyzed, yielding some surprising results. The coverage of any one engine is significantly limited: No single engine indexes more than about one-third of the "indexable Web," the coverage of the six ... Cited by 1067 - Related articles - BL Direct - All 71 versions

[PDF] from psu.eduFind it @ Oxford

Distributions of surfers' paths through the world wide web: Empirical characterizationsPLT Pirolli, JE Pitkow - World Wide Web, 1999 - SpringerSurfing the World Wide Web (WWW) involves traversing hyperlink connections among documents. The ability to predict surfing patterns could solve many problems facing producers and consumers of WWW content. We analyzed WWW server logs for a WWW site, ... Cited by 118 - Related articles - BL Direct - All 18 versions

[PDF] from psu.eduFind it @ Oxford

Web mining: Information and pattern discovery on the world wide webR Cooley, B Mobasher, J Srivastava - ictai, 1997 - computer.orgAbstract Application of data mining techniques to the World Wide Web, referredto as Web mining, has been the focus of sever al recent research projects and pap ers. However, there is no established vo cabulary, leading to confusion when comparing r esearch efforts.The ... Cited by 861 - Related articles - Check Availability - All 55 versions

[PDF] from psu.edu

{click /}

{cli

ck /

}

paper

Web Images Videos Maps News Shopping Gmail more ! [email protected] |

Search Advanced Scholar Search

Search within articles citing Albert: The diameter of the world wide web

Scholar All anytime include citations Create email alert

Emergence of scaling in random networksAL Barabási, R Albert - Science, 1999 - sciencemag.orgPage 1. DOI: 10.1126/science.286.5439.509 , 509 (1999); 286 Science et al. Albert-LászlóBarabási, Emergence of Scaling in Random Networks This copy is for your personal,non-commercial use only. . clicking here colleagues, clients, or customers by ... Cited by 8027 - Related articles - BL Direct - All 57 versions

[PDF] The structure and function of complex networksMEJ Newman - Arxiv preprint cond-mat/0303516, 2003 - CiteseerInspired by empirical studies of networked systems such as the Internet, social networks, and bio- logical networks, researchers have in recent years developed a variety of techniques and models to help us understand or predict the behavior of these systems. Here we review ... Cited by 4908 - Related articles - View as HTML - BL Direct - All 140 versions

[PDF] Surfing the p53 networkB Vogelstein, D Lane, AJ Levine" - Nature, 2000 - cmbi.bjmu.cnTumour-suppressor genes are needed to keep cells under control. Just as a car's brakes regulate its speed, properly func- tioning tumour-suppressor genes act as brakes to the cycle of cell growth, DNA repli- cation and division into two new cells. When these genes fail to ... Cited by 2928 - Related articles - View as HTML - All 10 versions

[CITATION] Evolution of networksSN Dorogovtsev, JFF Mendes - Advances in Physics, 2002 - Taylor & FrancisCited by 2846 - Related articles - Library Search - BL Direct - All 54 versions

The large-scale organization of metabolic networksH Jeong, B Tombor, R Albert, ZN Oltvai, AL Barabási - Nature, 2000 - nature.comIn a cell or microorganism, the processes that generate mass, energy, information transfer and cell-fate specification are seamlessly integrated through a complex network of cellular constituents and reactions 1 . However, despite the key role of these networks in sustaining cellular ...

cited_by

title:

authors:

world wide web 1 2

4+

3

doc("scholar.google.com")/descendant::field()[1]/{"world. . . "}Ê/following::field()[1]/{click/}Ë/(//a[contains(string(.),’Next’)]/{click/})*Í//div.gs_r:<paper>[.//h3:<title=string(.)>][.//*.gs_a:<authors=substring-before(.,’ - ’)>][.//a[.#=’Cited by’]/{click /}Ì//div.gs_r:<cited_by>[.//h3:<title=.>]

[.//*.gs_a:<authors=substring-before(.,’ - ’)>]]

Figure 11: Finding an OXPath through Google Scholar

D. OXPATH PROPERTIESNo sorting across Pages. OXPath avoids sorting context setswhich contain nodes from different pages, since it is unclear how toorder nodes from different pages, without first retrieving (and thusbuffering) those pages.

PROPOSITION 7 (NO NODE SORTING ACROSS PAGES). Theevaluation of an OXPath expression never requires sorting contextsets which contain nodes from different pages.

PROOF. OXPath requires sorting only for positional qualifiersin Rule N4 and bracketed expressions in Rule N11 (Appendix C).In both cases, the function REWRITE±(q±,C,c′) sorts the tuples inthe context set C and determines the position of c′ within C. Thus,it suffices to show, that C in Rule N4 and N11 never contains nodesfrom different pages.

This holds in N4, since C = JstepKV (c) is computed from a sin-gle axis navigation axis :: nodes (N2) and a sequence of (positional)qualifiers (N3 and N4) and markers (N7). Since N4 always resultsin a context set with nodes from the same page, and since N2-N47can only remove nodes from the context set, C must contain nodesfrom a single page only.

This holds in N11 as bracketed expressions may not contain ac-tions by Appendix B and neither N2–N4 nor N7–N10 return nodesfrom a different page than the context node, if no nested actionsoccur in the expressions.

Streaming Extraction Tuples. OXPath’s semantics does notrequire any further processing on result tuples, but allows them tobe streamed out as they are extracted. Extracted tuples are nevermodified, deleted, or re-accessed again.

PROPOSITION 8 (NO OUTPUT BUFFER). The evaluation ofOXPath expressions requires no output buffer.

PROOF. Only rules E7.1 and E7.2 in Table 3 create output tu-ples. Each created tuple is unique, as OUT is injective. Further-more, when the tuples are created the parent output nodes areknown by construction and thus no buffering is needed to createthe proper tuples in Rchild.

All other rules in the extraction semantics only collect the tuplesreturned by their sub-expressions. Since no duplicate tuples arecreated this collection does require no buffering.

Intuitively, this holds as the structure of the output tree reflectsthe structure of the OXPath expression and thus parent nodes arealways created before their children nodes.

Consider the OXPath expression expr[p1][p2]. If p1 containsextraction expressions, the extracted tuples are returned whetherp2 matches or not.

Requiring that tuples extracted by p1 are returned only if p2matches, would require an unbounded buffer, as the visited pagesand extracted results for p1 are both unbounded. Furthermore, wecan achieve the same effect in the existing OXPath semantics atthe cost of an increased query size (and thus evaluation time): TheOXPath expression expr[p′2][p1][p2] where p′2 is obtained from p2by removing all extraction markers. p′2 matches if and only if p2matches, as extraction markers do not affect matching, and tuplesextracted by p1 are only returned if p2 matches.

Complexity. We offer a proof for the theorem stated in Sec-tion 2.3.

PROOF OF THEOREM 1. Data Complexity: From all extensionsto XPath, only the Kleene star causes an increase in complexity:Actions are assumed to take constant time, extraction markers do

Page 11: OXPath: A Language for Scalable, Memory-efficient Data ...€¦ · the web. Due to the web’s rapid growth, humans can no longer find all relevant data without automation. Indeed,

not require additional memory as they are streamed out, and theadditional axis does not introduce further complexity.

XPath 1.0 without string concatenation and multiplication hasdata complexity LOGSPACE [3]. Each Kleene star expression canbe realized as transitive, reflexive closure of the Kleene star re-peated expression, therefore we arrive at NLOGSPACE data com-plexity for OXPath without string concatenation and multiplication.

Combined Complexity: PTIME-hardness follows immediatelyfrom the PTIME-hardness for XPath query evaluation [3].

To evaluate an OXPath query, we process query left to right.Since evaluating each subexpression requires at most polynomialtime, our overall evaluation runs in polynomial time as well.

If a subexpression is an XPath expression which makes no useof our extensions except for one of the additional axes, we rely onone of the known polynomial time algorithms for XPath [7]. Ifthe expression is a marker, we stream out the extracted matching,which is in polynomial time, too, and actions are assumed to takeconstant time.

The only remaining case is the Kleene star: If the Kleene star re-peated expression contains a non-nested action, we know that eachiteration of the repeated expression leads to a new page. Conse-quently, there are at most input size many iterations. If the Kleenestar does not contain an action it is an ordinary Regular XPath ex-pression which can be evaluated in polynomial time [12].

E. PROOFS FOR PAAT ALGORITHMPROOF OF PROPOSITION 2. We call an output node o gener-

ated by a marker M and input node n if OUT(n,M) = o. n and Muniquely identify o among all output nodes as OUT is injective. >is generated by any marker and input node. A sub-expression e ofExpr has nesting level i in Expr, if it is nested inside i predicates inExpr.

In OXPathbasic, eval− is called for a sub-expression e only withcontext tuples (i, p,s) for which it holds that there is a marker P anda marker S, such that all the parent output nodes p are generated byP and all the sibling output nodes s by S. P and S can be staticallydetermined for each e: P is the first extraction marker with a lowernesting level on the path from e to the start of the expression. S isthe first extraction marker with the same nesting level on that path.

Therefore, each context node i occurs with at most n differentparent output nodes and n different sibling nodes, and for each sub-expression the number of context tuples is bounded by O(n3). Thesize l of the lookup table Lookup is thus bounded by O(q ·n4).

As already remarked, for OXPathbasic, eval just passes the eval-uation through to eval− and does not affect the complexity.

The overall complexity of OXPathbasic is bounded by O(l ·o ·q)where o is the time for evaluating OXPath functions, extractionmarkers, and node-tests (see F , B, and U in Appendix C). o isbound by O(n2 · q) for all operations and functions in XPath, seeTheorem 6.6 in [7], and extraction markers do not affect this com-plexity. This includes the cost for sorting needed in evaluating apositional predicate (which is avoided in [7] by extending the con-text tuple) at O(n · logn).

Thus the overall time is bound by O(n6 ·q2).For space, we have a bound of O(l ·v ·q) where l is as above and

v is the maximum size of a value. Again from Theorem 6.6 [7] wehave a bound of O(n · q) for v and thus an overall space bound ofO(n5 ·q2).

It is worth noting that this is an increase of a multiplicative factorof n2 over full XPath in both cases. That is due to the fact that in ourcase there is no functional dependency from the context tuple to theresult value and that our context tuples are increased by the marker

0

10

20

30

40

50

60

70

50 100 150 200 250 300 350 400 450 0

1000

2000

3000

4000

5000

6000

7000

me

mo

ry [

MB

]

#m

atc

he

s [

10

]/ #

pa

ge

s

time [sec]

memoryextracted matches

visited pages

Figure 12: Scaling OXPath: many results

information. We can, however, omit the parent node (as we allowaccess to positional information only in positional predicates).

PROOF FOR PROPOSITION 3. OXPath−Kleene includes actionsand thus Expr may access more than one page. However, the pagetree (see Section 2.2) is navigated one page at a time by eval: Anew page is only reached by an action in Line 11. For each suchpage the remaining expression is evaluated and then all lookup en-tries for that page are deleted (Line 16).

This buffer management is implemented in eval by means of theprot flag and the three page management functions doc(), getPage,and FreeMem: FreeMem(expr) deletes all buffers associated withexpr. doc(uri,prot) returns the root node of the page with the givenuri. If prot is false, it replaces the previous page (in the underly-ing browser) and deletes memorized results for the previous page.getPage(node,action,prot) executes action on the given node. Ifprot is true, it first clones the page in the browser before executingthe action, otherwise it uses the same browser window and deletesall memoized results for the old page.

Thus, we traverse the page tree in a depth-first fashion. As thereis no repeated traversal of actions in a Kleene star free expression,we can traverse at most to a depth of O(q) (if each step in Expr isan action) into the page tree.

Therefore, the lookup table in eval− is bounded by O(n4 · q ·min(q,d)) = l.

Overall in the evaluation up to O((p · n)4 · q) = t context tu-ples are created and the complexity of OXPathbasic is bounded byO(t · oa · q) where o is the time for evaluating OXPath functions,extraction markers, and node-tests, but now also actions. If onlyabsolute actions occur oa is bounded by O(n2 · q) (see proof ofProposition 2) as actions do not affect the complexity as for eachcontext node a single page root is returned in constant time. Thisassumes that action execution is constant. Though action executionmay require the browser engine to evaluate an arbitrary, potentiallynon-terminating Javascript program, in practice action execution isalmost always near instant. Thus for OXPath−Kleene without con-textual actions we obtain O((p · n)6 · q2) as time and O(n5 · q3) asspace limit in analogy to the case for basic OXPath. For the spacebound it is worth noting that the size of values remains the same asin basic OXPath (and XPath) as actions may not occur in parame-ters of functions.

For contextual actions, the same observation as for absolute ac-tions applies except that we also reevaluate the action-free prefixon the retrieved page, if it has been modified. In the worst-case, ev-ery step in Expr carries a contextual action and the action-free pre-

Page 12: OXPath: A Language for Scalable, Memory-efficient Data ...€¦ · the web. Due to the web’s rapid growth, humans can no longer find all relevant data without automation. Indeed,

0

5

10

15

20

25

30

35

0 100 200 300 400 500 600 700 800

tim

e [m

in]

#pages

OXPathLixto

Web HarvestChickenfoot

Figure 13: Comparison: Absolute evaluation time, <850 p.

fix is always O(q). Then we evaluate the action-free prefix O(q)times and the overall time spent evaluating action-free prefixes isbounded by O(n6 ·q3) and space by O(n5 ·q2).

Thus the overall time is bound by O((p · n)6 · q2 + n6 · q3) ≤O((p · n)6 · q3) and the space bound remains the same as for ab-solute actions.

PROOF OF THEOREM 4. If we also consider the boundedKleene star, we have to observe that the number of pages bufferedat one time is no longer necessarily bound by the size of the ex-pression, as we do not know in advance how often a Kleene starthat contains an expression that navigates to a new page matches.Rather the number of buffered pages is bound by the maximumlevel d of any accessed page in the page tree for Expr. If d < ∞,the above complexity applies: The time complexity remains un-changed from OXPath−Kleene, but the space complexity increases,as the number of memorization tables depends on the query sizeand the maximum number of successful expansion steps for aKleene star expression, which is bounded by d. Again the size ofvalues remains the same as actions are not allowed in parameters offunctions and operators and Kleene stars without contained actionsyield at most n nodes and thus, with Theorem 6.6 [7], values of atmost O(n ·q) size.

PROOF OF COROLLARY 5. For expressions with unboundedKleene stars the length of the paths in the navigation tree from theroot that can be reached by an OXPath expression is bounded bythe expression. Thus, the corollary follows immediately from The-orem 3.

PROOF OF THEOREM 6. Consider the series of expressionsed = doc(w)rd with rd = //*/{action}rd−1 for d > 1 andr1 = self::*. Assume further that in the page tree of the expres-sions ed each page has at least two nodes with an action that leadsagain to another page of this form. ed executes {action} on allnodes of page w, and continues recursively from all pages thusreached. It returns the roots of the pages finally reached.

When we evaluate ed with PAAT, we access the page tree up toa depth of d and use exactly d page buffers. This holds, since theaccessed page tree has at least two branches at each page.

Any other algorithm A must load the leaves of the accessed pagetree of PAAT as these nodes are the result of evaluating Jed KV. To

visit such a leaf node l of the accessed page tree, we have to loadits parent p first, because without prior knowledge all children of pare only accessible by performing {action} on the respective nodein p. Thus, A must have loaded all d− 1 ancestors of l to finallyaccess l. Assume that l is the first leaf reached by A. Then, A mustbuffer all d−1 ancestors in addition to l, because for each ancestorof l there are further children to be visited.

F. COMPARATIVE EVALUATIONFor the comparative evaluation we implement the following OX-

Path expression in the other systems:doc("scholar.google.com")/descendant::field()[1]/{"Seattle"}

/following::field()[1]/{click /}/(//a[string(.)#="Cited by"]/{click/})*{0,3}

An equivalent Web Harvest program uses 54 lines(seediadem-project.info/oxpath, a Chickenfoot script uses 27(see below). The other tools use visual interfaces.

Chickenfoot Script for Google Scholar

go("http://scholar.google.co.uk/")// For the first two fields, Chickenfoot’s label heuristics fail and

thus// we need to select the fields using XPathclick(new XPath("/HTML[1]/BODY[1]/CENTER[1]/FORM[1]

/TABLE[1]/TBODY[1]/TR[1]/TD[2]/INPUT[1]"))enter(new XPath("/HTML[1]/BODY[1]/CENTER[1]/FORM[1]/

TABLE[1]/TBODY[1]/TR[1]/TD[2]/INPUT[1]"),"world wideweb")

click("Search button")

m=find("Cited by");for (i=1; i<=m.count;i++) {click(new XPath("/descendant::a[contains(.,\"Cited by\")][" + i

+"]"));n=find("Cited by");v=find(new XPath("//h3"));output(v);for (j=1; j<=n.count;j++) {click(new XPath("/descendant::a[contains(.,\"Cited by\")][" + j

+"]"));z=find(new XPath("//h3"));output(z);o=find("Cited by");for (k=1; k<=o.count;k++) {click(new XPath("/descendant::a[contains(.,\"Cited by\")][" + j

+"]"));zz=find(new XPath("//h3"));output(zz);back();

}back();

}back();

}

AcknowledgementsThe research leading to these results has received funding from theEuropean Research Council under the European Community’s Sev-enth Framework Programme (FP7/2007–2013) / ERC grant agree-ment no. 246858. This work was carried out in the wider contextof the networking programme FoX – Foundations of XML, FET-Open grant agreement number FP7-ICT-233599, of which OxfordUniversity is a partner. The views expressed in this article are solelythose of the authors.