Upload
abdelaziz-al-rihawi
View
232
Download
3
Embed Size (px)
Citation preview
Extraction-Based AutomaticSummarization
Abdelaziz Al-Rihawi
Mohammad Kher Kabbaby
Faculty of Information Technology Engineering
Damascus University
Artificial Intelligence Department
Classification of Summarization Tasks
Summary Type
Extraction-based• extracts objects from the entire collection, without modifying the objects themselves.
• where the goal is to select whole sentences (without modifying them)
Abstraction-based• Retelling the selected sentences to end summary.
Use of External resources
Knowledge-Poor• don’t use any external resources to generate summery
Knowledge-Rich• May utilize external corpus such as Wikipedia or lexical resources such as
WordNet or VerbOcean .
• used to unravel semantic relations between words, phrases or sentences.
Task Specific Constrains
Query-Focused• query is provided to a summarizer in addition to the source documents.• summarizer construct a summary that contains information requested by the
query.
Update• The purpose of the update summary is to identify new pieces of information
in the more recent articles with the assumption that the user has already read the previous ones
Guided• a set of aspects that should be covered in a summary is provided
Summarization Workflow
Preprocessing
Sentence Representation
Similarity Measures
Content Selection
Preprocessing
Sentence Segmentation
Short Sentence Removal
Word Segmentation
Stop Word Removal
Short Word Removal
Stemming
Sentence Segmentation
Splitting a text into sentences using unsupervised sentence boundaryidentification algorithm Punkt.
Short Sentence Removal
Sentences containing less than three words are considered to be notinformative enough or incorrectly segmented.
Word Segmentation
Splitting a sentence into words based on spaces and punctuation and using the conventions used by the Penn Treebank1 to handle special cases
Example:“weren’t” is split into two words “were” and “n’t”
Stop Word Removal
Stop words like: { and, the, or, … }
Define function like is_a to remove stop words.
Short Word Removal
Words smaller than three letters are considered to be non-content words.
Stemming
A rule-based stemming algorithm, the Porter stemmer.
Output of Preprocessing
Example:
“ Every weekend, students in Zhengzhou can take in a free concert, a traditional Chinese opera or a stage play in the city’s Youth and Children’s Palace.”
Output of Preprocessing
Words:
{ weekend, student, zhengzhou, take, free, concert, trait, chines, opera, stage, play, citi, youth, children, palace}
Sentence Representation
Feature Selection
• Term as Feature
• Features are obtained by selecting unique words from the preprocessed sentences.
• Each sentence is represented as a context vector
• The vectors are accumulated into a representation matrix
Feature Selection
Term Count (TC)
Represents sentences as vectors of which elements are the absolute frequency of words in the sentence
Feature Selection
Term frequency-inverse sentence frequency (TF-ISF)
the same weighting scheme as the TF-IDF but sentences are used instead of documents.
T F(w,d) = TC (w,d)
|d|
IDF(w) = log|D|
|D(w)|+ 1
W : word w|d| : the number of words in document d
Feature Selection
Latent semantic analysis:
A distributed representation model
Matrix Vk or the matrix product Sk ·Vk
obtained from SVD are used as the sentence representation matrix.
Feature Selection
LSA Algorithm:
Step 1 - Creating the Count MatrixStep 2 - Modify the Counts with TFIDFStep 3 - Using the Singular Value DecompositionStep 4 - Sentence Selection for summary.
Feature Selection
art concert capit citi children educ
S1 1 0 0 0 1 1
S2 1 1 1 0 0 0
s3 0 1 0 1 0 0
S4 0 0 0 1 0 1
Count Matrix
Feature Selection
art concert capit citi children educ
S1 0.23 0 0 0 0.46 0.23
S2 0.23 0.23 0.46 0 0 0
s3 0 0.35 0 0.35 0 0
S4 0 0 0 0.35 0 0.35
TF-IDF representation Matrix
Feature Selection
D1 D2 D3
S1 0.23 0 0
S2 0.23 0.23 0.46
s3 0 0.35 0
S4 0 0 0
LSA representation Matrix
Similarity Measures
Similarity Measures
• Corpus-based• measures use term frequencies observed in a corpus to relate contexts to
each other
• Knowledge-based• predefined semantic relations between terms obtained from lexical resources
Similarity Measures
Jaccard similarity coefficient
set-based similarity metric used for measuring similarities between sentences represented with TC representation.
Similarity Measures
Cosine similarity
a vector-based similarity metric used for representations with real-valued weights such as TF-ISF and LSA.
Sentence Selection
Sentence Selection
The goal of the selection procedure is to identify a set of sentences that contain important information.
Three criteria are optimized when selecting the sentences:1. Relevance2. Redundancy3. Length
Maximize the relevance while minimizing the redundancy
Sentence Selection
Selection of sentences can be handled either:
1. Supervised Methods
2. Unsupervised Methods
Sentence Selection
Supervised Methods:
• use a classifier trained on a set of documents coupled with corresponding extracts
• possible to label sentences with a binary value: • 1- a sentence is included in the extract• 0 - a sentence is not included in the extract.
• each sentence should be represented by a feature vector
• a classifier is trained on a set of feature vectors
Sentence Selection
Supervised Methods:
1. cue phrases and topic terms
2. position of a sentence in a document
3. centrality of a sentence1. for example a similarity between a sentence and other
4. length of a sentence1. for example the number of open-class words (i.e. nouns, main verbs,
adjectives, adverbs) in a sentence
Sentence Selection
Unsupervised Methods:
• unsupervised summarization algorithms are either centroid-basedor centrality-based.
Sentence Selection
Centroid-based Algorithm:
• select sentences that contain informative words
• Refereed to as topic signatures
• Calculation of informativeness :• using popular weighting schemes such as TF-IDF or log-likelihood ratio
Sentence Selection
Centroid-based Algorithm:
• When the similarity between each pair of sentences is available
• for a sentence S is to take the average of the similarities between S and all the other sentences
• The algorithms described above rely on superficial features ignoring higher-level semantic information such as semantic relations between terms, so we use Graph theory.
Use of Graphs in AutomaticSummarization
Graph Representations
Similarity graph Event graph
Graph Representations
• Similarity relations.
• Semantic relations such as semantic roles, cause-consequences, specifications, time relations.
Centrality Measures
• Graph theory and network analysis provide a great number of different methods and algorithms for working with graphs
• Length of edges in this graph correspond to actual distances between nodes
• The size of a node will be modified according to the centrality of this node
• calculated using a specific centrality measure, so that more central nodes will be larger
Centrality Measures
• Degree-based Methods
• Path-based Methods
Reference
Extraction-Based AutomaticSummarization
Gleb Sizov
June 2010
Master of Science in Computer Science
Gleb Sizov
Thanks!!