17
1 Today Tools (Yves) Efficient Web Browsing on Hand Held Devices (Shrenik) Web Page Summarization using Click-through Data (Kathy) On the Summarization of Dynamically Introduced Data (Clement) Ocelot: A System for Summarizing Web Pages (Kathy)

1 Today Tools (Yves) Efficient Web Browsing on Hand Held Devices (Shrenik) Web Page Summarization using Click- through Data (Kathy) On the Summarization

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1 Today  Tools (Yves)  Efficient Web Browsing on Hand Held Devices (Shrenik)  Web Page Summarization using Click- through Data (Kathy)  On the Summarization

1

Today

Tools (Yves) Efficient Web Browsing on Hand Held

Devices (Shrenik) Web Page Summarization using Click-

through Data (Kathy) On the Summarization of Dynamically

Introduced Data (Clement) Ocelot: A System for Summarizing Web

Pages (Kathy)

Page 2: 1 Today  Tools (Yves)  Efficient Web Browsing on Hand Held Devices (Shrenik)  Web Page Summarization using Click- through Data (Kathy)  On the Summarization

2

Past projects Ranking and recommending bars Clothing search engine Improving multilingual search Segmenting the bible by style Investigating bias in media reporting using bootstrapping Admatch: finding businesses with similar characteristics News event type clustering Query directed post-MT editing Retrieval of documents relating to different query senses

Page 3: 1 Today  Tools (Yves)  Efficient Web Browsing on Hand Held Devices (Shrenik)  Web Page Summarization using Click- through Data (Kathy)  On the Summarization

3

Web Page Summarization Using Clickthrough Data Extractive, generic summarization for

web-pages Clickthrough data

Could be collected from millions of users Queries that got users to this page

<u,q,p>: user, query, page clicked on

Assumption: Query word set may cover multiple topics in the web page

Problems: Clickthrough data is noisy

Page 4: 1 Today  Tools (Yves)  Efficient Web Browsing on Hand Held Devices (Shrenik)  Web Page Summarization using Click- through Data (Kathy)  On the Summarization

4

Experimental analysis of clickthrough data and summaries

One month of MSN clickthrough data Of 260,763 accessed web pages, 109,694

contain keyword meta-data 45.5% keywords occur in queries 13.1% of query words appear as keywords

Human summaries gathered of 90 web pages 58% of web pages contain query words 71.3% of summaries contain query words

Page 5: 1 Today  Tools (Yves)  Efficient Web Browsing on Hand Held Devices (Shrenik)  Web Page Summarization using Click- through Data (Kathy)  On the Summarization

5

Clickthrough data is important for summaries

Page 6: 1 Today  Tools (Yves)  Efficient Web Browsing on Hand Held Devices (Shrenik)  Web Page Summarization using Click- through Data (Kathy)  On the Summarization

6

Extractive summarization methods: words versus concepts Frequency (Luhn)

Frequency of terms in web page weighted against frequency of terms in query

Latent Semantic Analysis Singular value decomposition (SVD) of

termXdocument matrix Yields latent concepts Adapted here to weight query terms higher

Page 7: 1 Today  Tools (Yves)  Efficient Web Browsing on Hand Held Devices (Shrenik)  Web Page Summarization using Click- through Data (Kathy)  On the Summarization

7

When Clickthrough data is not enough

Thematic lexicon based on Dmoz: http://dmoz.org Provides term distributions per category Can be used when there are no query words for a page

Build the thematic lexicon For each category, construct TS = to all query terms used

to access any page in category Parents take the union of TS of its children Term weight = freq * ICF

Summarize: no click-through data, use term weights of its categories

Page 8: 1 Today  Tools (Yves)  Efficient Web Browsing on Hand Held Devices (Shrenik)  Web Page Summarization using Click- through Data (Kathy)  On the Summarization

8

Results

Latent Semantic Analysis (ALSA) better With Query words best

Page 9: 1 Today  Tools (Yves)  Efficient Web Browsing on Hand Held Devices (Shrenik)  Web Page Summarization using Click- through Data (Kathy)  On the Summarization

9

Questions

In LSA, does a sentence have to have a term which appeared in a query?

Interesting notion of data collection

Page 10: 1 Today  Tools (Yves)  Efficient Web Browsing on Hand Held Devices (Shrenik)  Web Page Summarization using Click- through Data (Kathy)  On the Summarization

10

Ocelot: A system for summarizing web pages Synthesizes/generates the words in the

summary Uses ODP (Demoz) as training/test Uses statistical models for generation Methods closely related to those in

statistical machine translation (MT)

Page 11: 1 Today  Tools (Yves)  Efficient Web Browsing on Hand Held Devices (Shrenik)  Web Page Summarization using Click- through Data (Kathy)  On the Summarization

11

Architecture

Content selection Select words according to their frequency Select words which do not occur in the summary

using semantic similarity

Ordering Ngram model of language

Search for the best sequence satisfying both content and ordering Viterbi search to find word that maximizes content

selection and ngram model

Page 12: 1 Today  Tools (Yves)  Efficient Web Browsing on Hand Held Devices (Shrenik)  Web Page Summarization using Click- through Data (Kathy)  On the Summarization

12

Models

Bag of words gisting: for each slot, draw a word from the bag

Plus new words: Draw a word, then replace with a similar word

Combine with tri-gram model

Page 13: 1 Today  Tools (Yves)  Efficient Web Browsing on Hand Held Devices (Shrenik)  Web Page Summarization using Click- through Data (Kathy)  On the Summarization

13

Data (ODP) Training: 103,064 summaries + links Test: 1046 summaries + links

Page 14: 1 Today  Tools (Yves)  Efficient Web Browsing on Hand Held Devices (Shrenik)  Web Page Summarization using Click- through Data (Kathy)  On the Summarization

14

Training: Word relatedness

Assume summary/gist and document are aligned

Each word in D corresponds to one word in G (gist), possibly the null word

If D contains m words and G contains n+1 words then (n+1)m alignments Use expectation maximization to choose

best alignment for D1|G1 pair then D2|G2 pair, etc.

Page 15: 1 Today  Tools (Yves)  Efficient Web Browsing on Hand Held Devices (Shrenik)  Web Page Summarization using Click- through Data (Kathy)  On the Summarization

15

Page 16: 1 Today  Tools (Yves)  Efficient Web Browsing on Hand Held Devices (Shrenik)  Web Page Summarization using Click- through Data (Kathy)  On the Summarization

16

Results using word overlap

Page 17: 1 Today  Tools (Yves)  Efficient Web Browsing on Hand Held Devices (Shrenik)  Web Page Summarization using Click- through Data (Kathy)  On the Summarization

17