Upload
hollye
View
72
Download
0
Tags:
Embed Size (px)
DESCRIPTION
By : asef poormasoomi Supervisor : Dr. Kahani autumn 2010. Concept based Multi-Document Text Summarization. Ferdowsi University of Mashad. Introduction. summary : brief but accurate representation of the contents of a document. Is this the best we can do?. Motivation. - PowerPoint PPT Presentation
Citation preview
Concept based Multi-Document Text Summarization
By : asef poormasoomiSupervisor: Dr. Kahaniautumn 2010
Ferdowsi University of Mashad
Introduction
summary: brief but accurate representation of the contents of a document
Is this the best we can do?
Motivation
• Abstracts for Scientific and other articles• News summarization (mostly Multiple document
summarization)• Classification of articles and other written data• Web pages for search engines• Web access from PDAs, Cell phones• Question answering and data gathering
• Extract vs. abstract• lists fragments of text vs. re-phrases content coherently.
• example : He ate banana, orange and apple=> He ate fruit• Generic vs. query-oriented
• provides author’s view vs. reflects user’s interest.• example : question answering system
• Personal vs. general• consider reader’s prior knowledge vs. general.
• Single-document vs. multi-document source• based on one text vs. fuses together many texts.
• Input • text , video, image , map
Genres
Methods
Statistical scoring methods (Pseudo) Higher semantic/syntactic structures
• Network (graph) based methods• Semantic based methods(LSA, ontology, WordNet)• Other methods (rhetorical analysis, lexical chains, co-
reference chains) AI methods
Statistical scoring (Pseudo)
General method: 1. score each entity (sentence, word) ;2. combine scores; 3. choose best sentence(s)
Scoring tecahniques:• Word frequencies throughout the text (Luhn 58)• Position in the text (Edmunson 69, Lin&Hovy 97)• Title method (Edmunson 69)• Cue phrases in sentences (Edmunson 69) • Bayesian Classifier (Kupiec at el 95)
Methods
Statistical scoring methods• problems :
• Synonymy: one concept can be expressed by different words. • example cycle and bicycle refer to same kind of vehicle.
• Polysemy: one word or concept can have several meanings. • example, cycle could mean life cycle or bicycle.
• Phrases: a phrase may have a meaning different from the words in it. • An alleged murderer is not a murderer (Lin and Hovy 1997)
Higher semantic/syntactic structures• Network (graph) based methods• Other methods (rhetorical analysis, lexical chains, co-reference chains)
AI methods
LSI based summarization(Gong , 2001)
Make Term-Sentence Matrix
Apply SVD on Term-Sentence Matrix
Problem TFISF con not show context and relation correctly
Proposed Approach
Preprocessing (Tokenizing, Stopword, Stemming) Extract Context (Use LSA on Term-Document) Extract Perspective(SRL and WordNet) Summary Generation
Proposed Approach
Preprocessing Tokenizing And Remove Stop words Stemming and make Term-Document matrix A
Extract Context Use SVD on A and use matrix U(term-Concept) Calculate Cosine distance between Concepts And Documents
Calculate Cosine distance between Sentences And Concept of each Topic Rank Sentences
Proposed Approach
Extract Perspective Use SRL and WordNet for sentence similarity Cosine Distance Problem
S1 = United States Army, successfully tested an anti-missile defense system. S2 = U.S. military projectile interceptor, streaked into space and hit the target. S3 = Iran's weekend test of a long-range missile underscored the need for a U.S. national
missile defense system.
Semantic Similarity S1 = United States Army, successfully tested an anti-missile defense system. subject AM-MNR verb object
Summary Generation Remove Redundancy and Rank Sentence
Evaluation Tools & Summarization Systems
ROUGE : Recall-Oriented Understudy for Gisting Evaluation Types : ROUGE-N ، ROUGE-L ، ROUGE-W ، ROUGE-S, ROUGE-SU
MEAD http : //www.summarization.com/mead chinese , english , japanese , dutch
DMSumm http : //www. icmc.usp.br /~taspardo/DMSumm.htm portuguese , english
SweSum(Martin Hassel) http://swesum.nada.kth.se/index-eng.html english , german , italian , spanish , greek , ...
FarsiSum(Nima Mazdak , Martin Hassel)o http://swesum.nada.kth.se/index-eng.html
SUMMARIST PERSIVAL GLEANS SumUM RIPTIDES NTT GISTSumm GISTexter DiaSumm NeATS
[1] I. Mani. Automatic summarization. John Benjamins Publishing Company, 2001.[2] Yeh, J. Y., Ke, H. R., Yang, W. P., & Meng, I. H. Text summarization using a trainable summarizer and latent semantic analysis. Information Processing and Management, 41, 75-95, 2005.[3] Gong, Y., & Liu, X. Generic text summarization using relevance measure and latent semantic analysis. In Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval , SIGIR`01, New Orleans, 2001.[4] Steinberger, J., & Kabadjov, M.A. & Poesio, M., & Sanchez-Graillet,O . Improving LSA-based summarization with anaphora resolution. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing. 2005.[5] Yu, H. News summarization based on semantic similarity measure. Ninth International Conference on Hybrid Intelligent Systems, vol. 1, pp.180-183, 2009.[6] C. H. Papadimitrious, P. Raghavan, H. Tamaki, and S. Vempala. Latent semantic indexing:A probabilistic analysis. J. Comput. Syst. Sci., 61(2):217-235, 2000.[7] C. –Y. Lin and Hovy. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proccedings of NLT-NAACL, 2003.[8] Nomoto, T., & Yuji, M. A new approach to unsupervised text summarization. In Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR`01. New Orleans, Louisiana, United States, 2001.[9] J. Lee, S. Park, C. Ahn, D. Kim. Automatic generic document summarization based on non-negative matrix factorization. Information Processing and Management 2008.[10] Steinberger, J., & Poesio, M.,& Kabadjov, M.A. & Jeek, K .Two uses of anaphora resolution in summarization. Information Processing and Management: an International Journal, vol 43, November, 2007.[11] D. Wang, T. Li, S. Zhu, C. Ding. Multi-Document summarization via sentence-level semantic analysis and symmetric matrix factorization. SIGIR’08, July 2008, Singapore.[12] V. Gupta, G. S. Lehal, A Survey of Text Summarization Extractive Techniques. Journal of emerging thechnologies in web intelligence, august 2010
References
thanks
DUC 2007Document Understanding Conferences
AQUAINT corpus
Associated Press and New York Times(1998-2000) & Xinhua News Agency(1996-2000)
Totally 1125 Documents
25 DocumentIn each Topic45 Topics
Dataset Specifications
20057 TermsBy Stemming &
without S.W
262225 Terms without S.W531174 Terms
ROUGE-2ROUGE-SU4
32 system summarizer
Each Topic has 4 human summary
Ten NIST assessors wrote summaries for the 45 topics in the DUC 2007 main task.
Experimental Result
Recall On ROUGE-2
135791113151719212325272931GoCAT2
0.02
0.0300000000000001
0.0400000000000001
0.0500000000000001
0.0600000000000001
0.0700000000000002
0.0800000000000002
0.0900000000000002
0.1
0.11
0.12
0.13
Average result on 3 topics
Experimental Result
Recall On ROUGE-SU4135791113151719212325272931Go
CAT20.06
0.07
0.08
0.0900000000000001
0.1
0.11
0.12
0.13
0.14
0.15
0.16
0.17
0.18
Average result on 3 topics
The Best …
Topic : US missile defense system
Word Result Systems Evaluation
288 0.14273 1-2-3-4-8-9-16-20-21-29 ROUGE-2
250 0.13599 15 ROUGE-2
328 0.21631 1-3-4-8-9-14-16-22-28-29 ROUGE-SU4
250 0.18899 15 ROUGE-SU4