Concept based Multi-Document Text Summarization

Concept based Multi-Document Text Summarization

By : asef poormasoomiSupervisor: Dr. Kahaniautumn 2010

Ferdowsi University of Mashad

Introduction

summary: brief but accurate representation of the contents of a document

Is this the best we can do?

Motivation

• Abstracts for Scientific and other articles• News summarization (mostly Multiple document

summarization)• Classification of articles and other written data• Web pages for search engines• Web access from PDAs, Cell phones• Question answering and data gathering

• Extract vs. abstract• lists fragments of text vs. re-phrases content coherently.

• example : He ate banana, orange and apple=> He ate fruit• Generic vs. query-oriented

• provides author’s view vs. reflects user’s interest.• example : question answering system

• Personal vs. general• consider reader’s prior knowledge vs. general.

• Single-document vs. multi-document source• based on one text vs. fuses together many texts.

• Input • text , video, image , map

Genres

Methods

Statistical scoring methods (Pseudo) Higher semantic/syntactic structures

• Network (graph) based methods• Semantic based methods(LSA, ontology, WordNet)• Other methods (rhetorical analysis, lexical chains, co-

reference chains) AI methods

Statistical scoring (Pseudo)

General method: 1. score each entity (sentence, word) ;2. combine scores; 3. choose best sentence(s)

Scoring tecahniques:• Word frequencies throughout the text (Luhn 58)• Position in the text (Edmunson 69, Lin&Hovy 97)• Title method (Edmunson 69)• Cue phrases in sentences (Edmunson 69) • Bayesian Classifier (Kupiec at el 95)

Methods

Statistical scoring methods• problems :

• Synonymy: one concept can be expressed by different words. • example cycle and bicycle refer to same kind of vehicle.

• Polysemy: one word or concept can have several meanings. • example, cycle could mean life cycle or bicycle.

• Phrases: a phrase may have a meaning different from the words in it. • An alleged murderer is not a murderer (Lin and Hovy 1997)

Higher semantic/syntactic structures• Network (graph) based methods• Other methods (rhetorical analysis, lexical chains, co-reference chains)

AI methods

LSI based summarization(Gong , 2001)

Make Term-Sentence Matrix

Apply SVD on Term-Sentence Matrix

Problem TFISF con not show context and relation correctly

Proposed Approach

Preprocessing (Tokenizing, Stopword, Stemming) Extract Context (Use LSA on Term-Document) Extract Perspective(SRL and WordNet) Summary Generation

Proposed Approach

Preprocessing Tokenizing And Remove Stop words Stemming and make Term-Document matrix A

Extract Context Use SVD on A and use matrix U(term-Concept) Calculate Cosine distance between Concepts And Documents

Calculate Cosine distance between Sentences And Concept of each Topic Rank Sentences

Proposed Approach

Extract Perspective Use SRL and WordNet for sentence similarity Cosine Distance Problem

S1 = United States Army, successfully tested an anti-missile defense system. S2 = U.S. military projectile interceptor, streaked into space and hit the target. S3 = Iran's weekend test of a long-range missile underscored the need for a U.S. national

missile defense system.

Semantic Similarity S1 = United States Army, successfully tested an anti-missile defense system. subject AM-MNR verb object

Summary Generation Remove Redundancy and Rank Sentence

Evaluation Tools & Summarization Systems

ROUGE : Recall-Oriented Understudy for Gisting Evaluation Types : ROUGE-N ، ROUGE-L ، ROUGE-W ، ROUGE-S, ROUGE-SU

MEAD http : //www.summarization.com/mead chinese , english , japanese , dutch

DMSumm http : //www. icmc.usp.br /~taspardo/DMSumm.htm portuguese , english

SweSum(Martin Hassel) http://swesum.nada.kth.se/index-eng.html english , german , italian , spanish , greek , ...

FarsiSum(Nima Mazdak , Martin Hassel)o http://swesum.nada.kth.se/index-eng.html

SUMMARIST PERSIVAL GLEANS SumUM RIPTIDES NTT GISTSumm GISTexter DiaSumm NeATS

[1] I. Mani. Automatic summarization. John Benjamins Publishing Company, 2001.[2] Yeh, J. Y., Ke, H. R., Yang, W. P., & Meng, I. H. Text summarization using a trainable summarizer and latent semantic analysis. Information Processing and Management, 41, 75-95, 2005.[3] Gong, Y., & Liu, X. Generic text summarization using relevance measure and latent semantic analysis. In Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval , SIGIR`01, New Orleans, 2001.[4] Steinberger, J., & Kabadjov, M.A. & Poesio, M., & Sanchez-Graillet,O . Improving LSA-based summarization with anaphora resolution. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing. 2005.[5] Yu, H. News summarization based on semantic similarity measure. Ninth International Conference on Hybrid Intelligent Systems, vol. 1, pp.180-183, 2009.[6] C. H. Papadimitrious, P. Raghavan, H. Tamaki, and S. Vempala. Latent semantic indexing:A probabilistic analysis. J. Comput. Syst. Sci., 61(2):217-235, 2000.[7] C. –Y. Lin and Hovy. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proccedings of NLT-NAACL, 2003.[8] Nomoto, T., & Yuji, M. A new approach to unsupervised text summarization. In Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR`01. New Orleans, Louisiana, United States, 2001.[9] J. Lee, S. Park, C. Ahn, D. Kim. Automatic generic document summarization based on non-negative matrix factorization. Information Processing and Management 2008.[10] Steinberger, J., & Poesio, M.,& Kabadjov, M.A. & Jeek, K .Two uses of anaphora resolution in summarization. Information Processing and Management: an International Journal, vol 43, November, 2007.[11] D. Wang, T. Li, S. Zhu, C. Ding. Multi-Document summarization via sentence-level semantic analysis and symmetric matrix factorization. SIGIR’08, July 2008, Singapore.[12] V. Gupta, G. S. Lehal, A Survey of Text Summarization Extractive Techniques. Journal of emerging thechnologies in web intelligence, august 2010

References

thanks

DUC 2007Document Understanding Conferences

AQUAINT corpus

Associated Press and New York Times(1998-2000) & Xinhua News Agency(1996-2000)

Totally 1125 Documents

25 DocumentIn each Topic45 Topics

Dataset Specifications

20057 TermsBy Stemming &

without S.W

262225 Terms without S.W531174 Terms

ROUGE-2ROUGE-SU4

32 system summarizer

Each Topic has 4 human summary

Ten NIST assessors wrote summaries for the 45 topics in the DUC 2007 main task.

Experimental Result

Recall On ROUGE-2

135791113151719212325272931GoCAT2

0.02

0.0300000000000001

0.0400000000000001

0.0500000000000001

0.0600000000000001

0.0700000000000002

0.0800000000000002

0.0900000000000002

0.1

0.11

0.12

0.13

Average result on 3 topics

Experimental Result

Recall On ROUGE-SU4135791113151719212325272931Go

CAT20.06

0.07

0.08

0.0900000000000001

0.1

0.11

0.12

0.13

0.14

0.15

0.16

0.17

0.18

Average result on 3 topics

The Best …

Topic : US missile defense system

Word Result Systems Evaluation

288 0.14273 1-2-3-4-8-9-16-20-21-29 ROUGE-2

250 0.13599 15 ROUGE-2

328 0.21631 1-3-4-8-9-14-16-22-28-29 ROUGE-SU4

250 0.18899 15 ROUGE-SU4