44
1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial Intelligence Laboratory UIST 2006 · Montreux, Switzerland

1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

Embed Size (px)

Citation preview

Page 1: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

1

Enabling web browsers toaugment web sites’ filtering and sorting functionalities

David Huynh · Rob Miller · David Karger

MIT Computer Science & Artificial Intelligence LaboratoryUIST 2006 · Montreux, Switzerland

Page 2: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

2

Automatic web content scraping (2003 ― now)

1. Zhai, Y., and B. Liu. Web data extraction based on partial tree alignment. WWW 2005.

2. Hogue, A. and D. Karger. Thresher: automating the unwrapping of semantic content from the World Wide Web. WWW 2005.

3. Reis, D.C., P.B. Golgher, A.S. Silva, and A.F. Laender. Automatic Web news extraction using tree edit distance. WWW 2004.

4. Lerman, K., L. Getoor, S. Minton, and C. Knoblock. Using the structure of Web sites for automatic segmentation of tables. SIGMOD 2004.

5. Ramaswamy, L., et. al. Automatic detection of fragments in dynamically generated web pages. WWW 2004.

6. Wang, J.-Y., and F. Lochovsky. Data extraction and label assignment for Web databases. WWW 2003.

7. Arasu, A. and H. Garcia-Molina. Extracting structured data from Web pages. SIGMOD 2003.

8. Liu, B., R. Grossman, and Y. Zhai. Mining data records in Web pages. SIGKDD 2003.

Page 3: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

3

… but no one has tried to put …

Automatic structured web content scraping technologies

in the hands of

end-users

Page 4: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

4

… let’s run through a real task …

Paperback bookspublished in 2005 or later

by John Grishamon Amazon

Page 5: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

5

… that was a demo of putting …

Automatic structured web content scraping technologies

in the hands of

end-users

Page 6: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

6

Sifterbrowser extension

Page 7: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

7

Outline

• Motivations1. User Interface Design

• Extraction• Augmentation

2. Extraction Algorithm• Evaluations

1. Extraction Algorithm2. User Interface Design

• Conclusions

Page 8: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

8

Motivations

• Not all web sites are designed based on task analysis and user analysis.

• Faceted browsing?• Maps view?• Calendar view?

• Features are not implemented consistently across sites.

• Web browsers can provide a unified sorting/filtering interface.

• Not all users have exactly the same needs.• No site can ever design for all users.• Each web browser can tailor experience to its owner.

Page 9: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

9

Motivations

Page 10: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

10

Outline

• Motivations1. User Interface Design

• Extraction• Augmentation

2. Extraction Algorithm• Evaluations

1. Extraction Algorithm2. User Interface Design

• Conclusions

Page 11: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

11

User Interface Design – Extraction

• Web content extraction is a system precondition poorly understood by users.

• If it doesn’t let me do this,…• If the web site understands that this is the original price

( $8.99 ),…• If I can see that this is a date (“last Christmas”),…

Page 12: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

12

User Interface Design – Extraction

• Extraction is lengthy and error-prone.• We explore UI potentials even in the face of fragile

extraction.• This lets us know which aspects of extraction

should be improved first, and in which ways.

• We minimize the steps required to kick-start extraction.

• But we give the user an chance to make correction early.

Page 13: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

13

UI Design - Extraction

1st click

preview of results

controls for making correction

2nd click if all goes well

Page 14: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

14

Outline

• Motivations1. User Interface Design

• Extraction• Augmentation

2. Extraction Algorithm• Evaluations

1. Extraction Algorithm2. User Interface Design

• Conclusions

Page 15: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

15

User Interface Design - Augmentation

• Novelty• Presentation of data remains unchanged

• … except for a few asterisks.• Presentation might be well-designed with domain specific knowledge,

and worth to keep as-is.• Semantics of the data are in the presentation.• We want to maintain visual context.

• Filtering and sorting are supported without resorting to field names.

Page 16: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

16

User Interface Design - Augmentation

• By keeping the original visual presentation of the data, and then applying automatic content extraction technology, we can provide additional functionalities without needing, trying, or pretending to understand the semantics of the data.

format? binding? medium? who cares?!

Page 17: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

17

… ssshhhh …

Semantics is Overrated

Page 18: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

18

Outline

• Motivations1. User Interface Design

• Extraction• Augmentation

2. Extraction Algorithm• Evaluations

1. Extraction Algorithm2. User Interface Design

• Conclusions

Page 19: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

19

Extraction Algorithm

Detection of

1. Items of interest

2. Subsequentpages

3. Fieldswithin items

Page 20: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

20

1.Items occupy most of the page area.

2.Each item contains links.

Find THE set of similar links whose outer containers occupy the largest page area compared to other sets of links.

Extraction Algorithm - Assumptions

Page 21: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

21

BODY

TABLE

TR - item 1

DIVA

TDTD

TR - item 2

DIVA

TDTD

BODY

TABLE

TR

TD TD

DIV

A

TR

TD TD

DIV

A

Item 1 Item 2

Page 22: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

22

BODY

TABLE

TR - item 1

DIVA

TDTD

TR - item 2

DIVA

TDTD

BODY

TABLE

TR

TD TD

DIV

A

TR

TD TD

DIV

A

Item 1 Item 2

A

Page 23: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

23

BODY

TABLE

TR - item 1

DIVA

TDTD

TR - item 2

DIVA

TDTD

BODY

TABLE

TR

TD TD

DIV

A

TR

TD TD

DIV

A

Item 1 Item 2

DIV/A

Page 24: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

24

BODY

TABLE

TR - item 1

DIVA

TDTD

TR - item 2

DIVA

TDTD

BODY

TABLE

TR

TD TD

DIV

A

TR

TD TD

DIV

A

Item 1 Item 2

TD/DIV/A

Page 25: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

25

BODY

TABLE

TR - item 1

DIVA

TDTD

TR - item 2

DIVA

TDTD

BODY

TABLE

TR

TD TD

DIV

A

TR

TD TD

DIV

A

Item 1 Item 2

TR/TD/DIV/A

Page 26: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

26

BODY

TABLE

TR - item 1

DIVA

TDTD

TR - item 2

DIVA

TDTD

BODY

TABLE

TR

TD TD

DIV

A

TR

TD TD

DIV

A

Item 1 Item 2

TABLE/TR/TD/DIV/A

Page 27: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

27

BODY

TABLE

TR - item 1

DIVA

TDTD

TR - item 2

DIVA

TDTD

BODY

TABLE

TR

TD TD

DIV

A

TR

TD TD

DIV

A

Item 1 Item 2

BODY/TABLE/TR/TD/DIV/A

Page 28: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

28

BODY

TABLE

TR - item 1

DIVA

TDTD

TR - item 2

DIVA

TDTD

BODY

TABLE

TR

TD TD

DIV

A

TR

TD TD

DIV

A

Item 1 Item 2

BODY/TABLE/TR/TD/DIV/A

Found similar links!

Page 29: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

29

BODY

TABLE

TR - item 1

DIVA

TDTD

TR - item 2

DIVA

TDTD

BODY

TABLE

TR

TD TD

DIV

A

TR

TD TD

DIV

A

Item 1 Item 2

BODY/TABLE/TR/TD/DIV/A/..

Page 30: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

30

BODY

TABLE

TR - item 1

DIVA

TDTD

TR - item 2

DIVA

TDTD

BODY

TABLE

TR

TD TD

DIV

A

TR

TD TD

DIV

A

Item 1 Item 2

BODY/TABLE/TR/TD/DIV/A/../..

Page 31: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

31

BODY

TABLE

TR - item 1

DIVA

TDTD

TR - item 2

DIVA

TDTD

BODY

TABLE

TR

TD TD

DIV

A

TR

TD TD

DIV

A

Item 1 Item 2

BODY/TABLE/TR/TD/DIV/A/../../..

Page 32: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

32

BODY

TABLE

TR - item 1

DIVA

TDTD

TR - item 2

DIVA

TDTD

BODY

TABLE

TR

TD TD

DIV

A

TR

TD TD

DIV

A

Item 1 Item 2

BODY/TABLE/TR/TD/DIV/A/../../../..

Page 33: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

33

BODY

TABLE

TR - item 1

DIVA

TDTD

TR - item 2

DIVA

TDTD

BODY

TABLE

TR

TD TD

DIV

A

TR

TD TD

DIV

A

Item 1 Item 2

BODY/TABLE/TR/TD/DIV/A/../../..

Found one potential set of items!

Page 34: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

34

Extraction Algorithm – Subsequent page detection

Page 35: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

35

Extraction Algorithm – Subsequent page detection

• URL parameters• http://amazon.com/ ... ? ... &page=2& ...• http://amazon.com/ ... ? ... &page=3& ...• http://amazon.com/ ... ? ... &page=4& ...

Page 36: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

36

Outline

• Motivations1. User Interface Design

• Extraction• Augmentation

2. Extraction Algorithm• Evaluations

1. Extraction Algorithm2. User Interface Design

• Conclusions

Page 37: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

37

Evaluations – Extraction algorithm

• Test conducted over 30 web sites:• Amazon, BestBuy, CNET Reviews, Froogle, Target, Walmart, …

• Item detection• Items on 27 / 30 collections can be identified by xpaths

(in the remaining 3, items consist of sibling/cousin nodes)• … but only 24 / 27 were automatically detected

• Subsequent page detection• For 22 / 27 collections, subsequent pages could be identified.• For 19 / 22 collections, original numbers of items were recovered.

• Overall

• 19 / 30 = 63% accuracy• We measure accuracy at the level of whole collections, not

individual items.

Page 38: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

38

Evaluations – User Interface Design

• Extraction algorithm is still fragile

• Formative evaluation of UI

• Is “web content extraction” too high a conceptual barrier?

• Is in-place sorting/filtering augmentation usable?• No field name – usable?

• Is such augmentation useful?

Page 39: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

39

Page 40: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

40

Evaluations – User Interface Design

• Task 1: Structured• This task lets subjects get familiar with the UI.• No specific help or tutorial is provided.• Subject follows a sequence of high-level instructions

to ultimately perform a complex query.• sort by price• filter by date

• Subject is given 5 min to perform a similar query using the web site.

• Task 2: Unstructured• Subject judges whether a sale of several products is good.

Page 41: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

41

Evaluations – User Interface Design

• Task 1: Structured• 8/8 subjects completed the task using our system.• 5/8 … using the web site within 5 minutes.

• 1/8 knew about Amazon’s Advanced Search.

• All subjects were familiar with Amazon.• A unified filtering/sorting UI can be more usable

than different UIs on different sites.

• Task 2: Unstructured• 7/8 subjects completed the task using our system.• 1 refused to complete the task.

Page 42: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

42

Evaluations – UI Design

• Survey responses indicate• Our system is usable and useful• … while it offers advanced functionalities.

Page 43: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

43

Conclusions

• In our work, we …• Preserve original presentation to leverage the semantics

within it;• Provide filter/sort functionalities without field names;• Put automatic web content extraction technologies into the

hands of end-users;• Show evidence that it’s usable and useful.

• For future work, we will focus on …• Error recovery;• Merging data from several sites.

Page 44: 1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial

44

More information

• http://simile.mit.edu/wiki2/Sifter• Firefox extension installation file• Open source code + build instructions• Links to video and user study data