User oriented text summarization

BY

Asef poormasoomi

Motivation

summaries which are generic in nature do not cater to the user’s background and interests

results show that each person has different perspective on the same text

So a good summary should change in accordance to preferences of its reader

Motivation

Marcu-1997: found percent agreement of 13 judges over 5 texts from scientific America is 71 percent.

Rath-1961 : found that extracts selected by four different human judges had only 25 percent overlap

Salton-1997 : found that most important 20 paragraphs extracted by 2 subjects have only 46 percent overlap

Users FeedbackQuery History:

is the most widely used implicit user feedback at present.

http://www.google.com/psearch

Data Click:when a user clicks on a document, the document is

considered to be of more interest to the user than other unclicked ones

Attention Time :often referred to as display time or reading time

Other types of implicit user feedbacks :Other types of implicit user feedbacks include display

time, scrolling, annotation, bookmarking and printing behaviors

ARTICLE1:

Generating Personalized Summaries Using Publicly AvailableWeb Documents

2008 IEEE Chandan Kumar, Prasad Pingali, Vasudeva Varma

extract the personal information of the user using information available on the web

Generic Sentence Scoring In General : compute the probability distribution over the words w appearing in the input D, p(w|D) :

For each sentence S in the input, assign a weight equal to the average probability of the words in the sentence

Article1: Generating Personalized Summaries Using Publicly AvailableWeb Documents, 2008 IEEE, Chandan Kumar, Prasad Pingali, Vasudeva Varma

Estimating User Background model : used search engine to extract the personal information of the user using information available on the web.

put the person’s full name to a search engine (name is quoted with double quotation such as ”Albert Einstein”)

’n’ top documents are taken and retrieved.After performing the removal of stop words and

stemming, a unigram language model is learned on the extracted text content.

This model can be interpreted as the probability of a word w being related to the person’s profile U :



User Specific Sentence Scoring :

the term probability of the document set D p(w|D), and the user profile U p(w|U) have been merged using a linear weighted combination. The score of a sentence S for user u is given as :


After sentence scoring, eliminate redundancy :

for redundancy identification, use the measure of number of terms overlapping between the already generated summary and the new sentence being considered

sentence are arranged based on chronological ordering (between documents i.e.based on the time stamp) and order of occurrence (within the document).



Example : Topic of summary generation is ”Microsoft to open

research lab in India”8 articles published in different new sources forms

the news clusterIn the example we are showing the condensed

summary(100 words) for two users. User A is from NLP domain and User B from network security domain.

The italic text in user specific summary shows the differnce compare to generic summary


Generic summary:

The New Lab, Called Microsoft Research India, Goes Online In January, And Will Be Part Of A Network Of Five Research Labs That Microsoft Runs Worldwide, Said Padmanabhan Anandan, Managing Director Of Microsoft Research India. Microsoft’s Mission India, Formally Inaugurated Jan. 12, 2005, Is Microsoft’s Third Basic Research Facility Established Outside The United States . In Line With Microsoft’s Research Strategy Worldwide , The Bangalore Lab Will Collaborate With And Fund Research At Key Educational Institutions In India, Such As The Indian Institutes Of Technology, Anandan Said . Although Microsoft Research Doesn’t Engage In Product Development Itself, Technologies Researchers Create Can Make Their Way Into The Products The Company


User A Specific summary :

The New Lab, Called Microsoft Research India, Goes Online In January, And Will Be Part Of A Network Of Five Research Labs That Microsoft Runs Worldwide, Said Padmanabhan Anandan, Managing Director Of Microsoft Research India.Microsoft’s Mission India, Formally Inaugurated Jan. 12, 2005, Is Microsoft’s Third Basic Research Facility Established Outside The United States. Microsoft Will Collaborate With The Government Of India And The Indian Scientific Community To Conduct Research In Indic Language Computing Technologies, This Will Include Areas Such As Machine Translation Between Indian Languages And English, Search And Browsing And Character Recognition. In Line With Microsoft’s Research Strategy Worldwide,The Bangalore Lab


User B Specific summary :

The New Lab, Called Microsoft Research India, Goes Online In January, And Will Be Part Of A Network Of Five Research Labs That Microsoft Runs Worldwide, Said Padmanabhan Anandan , Managing Director Of Microsoft Research India. The Newly Announced India Research Group Focuses On Cryptography, Security, Algorithms And Multimedia Security, Ramarathnam Venkatesan, A Leading Cryptographer At Microsoft Research In Redmond, Washington, In The US, Will Head The New Group. Microsoft Research India will conduct a four-week summer school featuring lectures by leading experts in the fields of cryptography, algorithms and security. The program is aimed at senior undergraduate students, graduate students and faculty


EvaluationThe evaluation of this technique was carried out on

five different research scholars working in different fields of computer science

News articles of science and technology domain were considered for summarization. 25 different topics were chosen with each topic having 5-10 articles.

Each researcher was asked to judge the relevance of both versions of summaries for all 25 topics(1-5 score).

Result show that the users prefer profile based personalized summaries compared to a generic summary given by general automatic summarization system



EvaluationThis figure shows the scores given by a particular

user across different topics. for most of the topics user find personalized

summaries relevant for him.personalized summaries for the topics strongly related

to the user’s domain are more relevant to him For topics which are not closely related to user’s field,

the personalized and generic summaries are quite similar

For a few rare topics the user did not find personalized summary better


Evaluation

ARTICLE2:

User-oriented Document Summarization Through Vision-based Eye-tracking

2009 ACM Songhua Xu , Hao Jiang , Francis C.M. Lau

Article2: User-Oriented Document Summarization through Vision-Based Eye-

Tracking, 2009 ACM, Songhua Xu , Hao Jiang , Francis C.M. Lau

MAIN IDEAThe key idea is to rely on the attention (reading) time of

individual users spent on single words in a document.The prediction of user attention over every word in a document

is based on the user’s attention during his previous readsalgorithm tracks a user’s attention times over individual words

using a vision-based commodity eye-tracking mechanism.user attention time over any arbitrary word is predicted by a

data mining process

use simple web camera and an existent eye-tracking algorithm “Opengazer project”

The error of the detected gaze location on the screen is between 1–2 cm, depending which area of the screen the user is looking at (a 19” screen monitor).



Anchoring Gaze Samples onto Individual Words the detected gaze central point is positioned at (x; y) on the screen

spacecompute the central displaying point of the word which is denoted

as (xi; yi).

and are the average width and height of a word’s displaying bounding box in the document

For each gaze detected by eye-tracking module, assign the gaze samples to the words in the document in this manner.

The overall attention that a word in the document receives is the sum of all the fractional gaze samples it is assigned in the above process

During processing, remove the stop words.



PREDICTION OF USER ATTENTION OVER A SENTENCE

attention time prediction for a word is based on the semantic similarity of two words.

Sim(wi,wj) to denote the semantic similarity between word wi and word wj , where Sim(wi,wj) € [0; 1]

use the algorithm proposed in : Y. Li, Z. A. Bandar, and D. Mclean. An approach for measuring semantic similarity between words using multiple information sources. IEEE Transactions on Knowledge and Data Engineering.

for an arbitrary word w which is not among , calculate the similarity between w and every wi(i = 1,…, n) and then select k words which share the highest semantic similarity with w.(k is set as min(10; n) )



Predicting User Attention for Sentences estimate the total attention of a certain user on a sentence as

the sum of the user’s attention over all the words in the sentence :

AT(wi;Uj) is user Uj ’s attention over the word wi, which is either sampled from the user’s previous reading activities via (1) or predicted via (2).

= 0 if the word wi is a stop word; = 0:6 if there is no attention sample for the user Uj

over the word wi, = 1,otherwise





A Hybrid Summarization ApproachIn early experiments, noticed that the performance of

our user-oriented document summarization algorithm heavily depends on the amount of available user attention time samples

To address the issue, integrate new method with a conventional automatic document summarization algorithm(MEAD)

= 1 if sentence si is selected by MEAD in its document summarization result, = 0 otherwise.

k is free parameter and is user tunable.



EXPERIMENT RESULTScomparing the document summarization results with

those generated by two popular text summarization algorithms.

use two sets of articles. Articles in the first set are all about science (60 articles from “Science” magazine) and articles in the second set are all about entertainment and leisure (sixty articles are randomly selected from the travel and sports section on “New York Times”)

12 people with different knowledge backgrounds read some selected articles from the two article sets.

they are asked to provide a summary for the article they just read



EXPERIMENT RESULTS

to measure the performance , three measurements :Recall (R), Precision (P) and F-rate (F) are introduced

SU e is the human summary result



EXPERIMENT RESULTS



EXPERIMENT RESULTSexperiment to evaluate the performance of hybrid

approach under different settings for the parameter K.

Article3: WebPage Summarization Using Clickthrough Data, 2005 ACM, JianTaoSun, Dou Shen , HuaJun Zeng, Qiang Yang , Yuchang Lu , Zheng Chen

Main Ideause extra knowledge of the clickthrough data to

improve Web-page summarizationcollection of clickthrough data, can be represented

by a set of triples < u; q; p >Typically, a user's query words , reflect the true

meaning of the target Web-page content

In new algorithm, adapt two text-summarization methods to summarize Web pages.The first approach is based on significant-word selection

adapted from Luhn's methodThe second method is based on Latent Semantic Analysis

(LSA)


ProblemsWeb pages may have no associated query wordsthe clickthrough data are often very noisy

Solutionthematic lexicon : (using the annotated hierarchical

taxonomy of Web pages such as the one provided by ODP web-site (http://dmoz.org/))


Adapted Significant Word (ASW) Methodeach sentence is assigned a significance factor(word

frequency) and the sentences with high significance factors are selected to form the summary

customized factor :

Adapted Latent Semantic Analysis (ALSA) MethodThe corpus can be represented by a term-document

matrix.


Summarize Web Pages Not Covered by Clickthrough Databuild a thematic lexiconuse TS(c) to represent a set of terms associated with category c.thematic lexicon is a set of TS, which correspond with categories

in ODP.The lexicon is built as follows :

first, TS corresponding to each category is set emptyfor each page covered by the clickthrough data, its query words are added

into TS if a page belongs to more than one category, its query terms will be added

into all TS associated with all its categories.At last, term weight in each TS is multiplied by its Inverse Category

Frequency (ICF).For each Web page that are not covered by the clickthrough

data,first look up the lexicon for TS according to the page's category, Then the summarization methods are used.

When a TS does not have sufficient terms, TS corresponding with its parent category is used


EXPERIMENTS data set contains about 44.7 million records of 29 days from

Dec 6 of 2003 to Jan 3 of 2004 (MSN search engine )3,074,678 Web pages of the ODP directory are crawled. Web

pages crawledAt last got 1,125,207 Web pages, 260,763 of which are clicked

by Web users using 1,586,472 different queries.DAT1, consists of 90 pages which are selected from the browsed

pages.Three human evaluators were employed to summarize these

pages

they also use a relatively large scale data set, denoted by DAT2, to evaluate summarization methods(10,000 pages).


Summarization Results on DAT1 (ASW)ROUGE is a software package adopted by DUC for

automatic summarization evaluation (http://www.isi.edu/ cyl/ROUGE/)


Summarization Results on DAT1(ALSA)


evaluation summarization method using the thematic lexiconclickthrough data contains only 260,763 pages, and

lexicon contains 141,869 categories, which is a subset of the ODP category structure.

If terms under this category have more than P% overlap with distinct terms in the Web page, then they are used for summarization. Otherwise, use lexicon terms of its parent category.

This process continues until we find a category which covers enough query terms or until we reach the root of the thematic lexicon


evaluation summarization method using the thematic lexicon(ASW)


evaluation summarization method using the thematic lexicon(ALSA)

thanks