Extraction Based automatic summarization

Extraction-Based AutomaticSummarization

Abdelaziz Al-Rihawi

Mohammad Kher Kabbaby

Faculty of Information Technology Engineering

Damascus University

Artificial Intelligence Department

Classification of Summarization Tasks

Summary Type

Extraction-based• extracts objects from the entire collection, without modifying the objects themselves.

• where the goal is to select whole sentences (without modifying them)

Abstraction-based• Retelling the selected sentences to end summary.

Use of External resources

Knowledge-Poor• don’t use any external resources to generate summery

Knowledge-Rich• May utilize external corpus such as Wikipedia or lexical resources such as

WordNet or VerbOcean .

• used to unravel semantic relations between words, phrases or sentences.

Task Specific Constrains

Query-Focused• query is provided to a summarizer in addition to the source documents.• summarizer construct a summary that contains information requested by the

query.

Update• The purpose of the update summary is to identify new pieces of information

in the more recent articles with the assumption that the user has already read the previous ones

Guided• a set of aspects that should be covered in a summary is provided

Summarization Workflow

Preprocessing

Sentence Representation

Similarity Measures

Content Selection

Preprocessing

Sentence Segmentation

Short Sentence Removal

Word Segmentation

Stop Word Removal

Short Word Removal

Stemming

Sentence Segmentation

Splitting a text into sentences using unsupervised sentence boundaryidentification algorithm Punkt.

Short Sentence Removal

Sentences containing less than three words are considered to be notinformative enough or incorrectly segmented.

Word Segmentation

Splitting a sentence into words based on spaces and punctuation and using the conventions used by the Penn Treebank1 to handle special cases

Example:“weren’t” is split into two words “were” and “n’t”

Stop Word Removal

Stop words like: { and, the, or, … }

Define function like is_a to remove stop words.

Short Word Removal

Words smaller than three letters are considered to be non-content words.

Stemming

A rule-based stemming algorithm, the Porter stemmer.

Output of Preprocessing

Example:

“ Every weekend, students in Zhengzhou can take in a free concert, a traditional Chinese opera or a stage play in the city’s Youth and Children’s Palace.”

Output of Preprocessing

Words:

{ weekend, student, zhengzhou, take, free, concert, trait, chines, opera, stage, play, citi, youth, children, palace}

Sentence Representation

Feature Selection

• Term as Feature

• Features are obtained by selecting unique words from the preprocessed sentences.

• Each sentence is represented as a context vector

• The vectors are accumulated into a representation matrix

Feature Selection

Term Count (TC)

Represents sentences as vectors of which elements are the absolute frequency of words in the sentence

Feature Selection

Term frequency-inverse sentence frequency (TF-ISF)

the same weighting scheme as the TF-IDF but sentences are used instead of documents.

T F(w,d) = TC (w,d)

|d|

IDF(w) = log|D|

|D(w)|+ 1

W : word w|d| : the number of words in document d

Feature Selection

Latent semantic analysis:

A distributed representation model

Matrix Vk or the matrix product Sk ·Vk

obtained from SVD are used as the sentence representation matrix.

Feature Selection

LSA Algorithm:

Step 1 - Creating the Count MatrixStep 2 - Modify the Counts with TFIDFStep 3 - Using the Singular Value DecompositionStep 4 - Sentence Selection for summary.

Feature Selection

art concert capit citi children educ

S1 1 0 0 0 1 1

S2 1 1 1 0 0 0

s3 0 1 0 1 0 0

S4 0 0 0 1 0 1

Count Matrix

Feature Selection

art concert capit citi children educ

S1 0.23 0 0 0 0.46 0.23

S2 0.23 0.23 0.46 0 0 0

s3 0 0.35 0 0.35 0 0

S4 0 0 0 0.35 0 0.35

TF-IDF representation Matrix

Feature Selection

D1 D2 D3

S1 0.23 0 0

S2 0.23 0.23 0.46

s3 0 0.35 0

S4 0 0 0

LSA representation Matrix

Similarity Measures

Similarity Measures

• Corpus-based• measures use term frequencies observed in a corpus to relate contexts to

each other

• Knowledge-based• predefined semantic relations between terms obtained from lexical resources

Similarity Measures

Jaccard similarity coefficient

set-based similarity metric used for measuring similarities between sentences represented with TC representation.

Similarity Measures

Cosine similarity

a vector-based similarity metric used for representations with real-valued weights such as TF-ISF and LSA.

Sentence Selection

Sentence Selection

The goal of the selection procedure is to identify a set of sentences that contain important information.

Three criteria are optimized when selecting the sentences:1. Relevance2. Redundancy3. Length

Maximize the relevance while minimizing the redundancy

Sentence Selection

Selection of sentences can be handled either:

1. Supervised Methods

2. Unsupervised Methods

Sentence Selection

Supervised Methods:

• use a classifier trained on a set of documents coupled with corresponding extracts

• possible to label sentences with a binary value: • 1- a sentence is included in the extract• 0 - a sentence is not included in the extract.

• each sentence should be represented by a feature vector

• a classifier is trained on a set of feature vectors

Sentence Selection

Supervised Methods:

1. cue phrases and topic terms

2. position of a sentence in a document

3. centrality of a sentence1. for example a similarity between a sentence and other

4. length of a sentence1. for example the number of open-class words (i.e. nouns, main verbs,

adjectives, adverbs) in a sentence

Sentence Selection

Unsupervised Methods:

• unsupervised summarization algorithms are either centroid-basedor centrality-based.

Sentence Selection

Centroid-based Algorithm:

• select sentences that contain informative words

• Refereed to as topic signatures

• Calculation of informativeness :• using popular weighting schemes such as TF-IDF or log-likelihood ratio

Sentence Selection

Centroid-based Algorithm:

• When the similarity between each pair of sentences is available

• for a sentence S is to take the average of the similarities between S and all the other sentences

• The algorithms described above rely on superficial features ignoring higher-level semantic information such as semantic relations between terms, so we use Graph theory.

Use of Graphs in AutomaticSummarization

Graph Representations

Similarity graph Event graph

Graph Representations

• Similarity relations.

• Semantic relations such as semantic roles, cause-consequences, specifications, time relations.

Centrality Measures

• Graph theory and network analysis provide a great number of different methods and algorithms for working with graphs

• Length of edges in this graph correspond to actual distances between nodes

• The size of a node will be modified according to the centrality of this node

• calculated using a specific centrality measure, so that more central nodes will be larger

Centrality Measures

• Degree-based Methods

• Path-based Methods

Reference

Extraction-Based AutomaticSummarization

Gleb Sizov

June 2010

Master of Science in Computer Science

Gleb Sizov

Thanks!!