17
UNL Document Summarization Virach Sornlertlamvanich, Tanapong Potipiti and Thatsanee Charoen porn Information Research and Development Division National Electronics and Computer Technology Ce nter (NECTEC), THAILAND {virach,tanapong,thatsa nee}@nectec.or.th e First International Workshop on MultiMedia Annotation -31 January 2001, Tokyo, Japan

UNL Document Summarization

Embed Size (px)

DESCRIPTION

The First International Workshop on MultiMedia Annotation 30-31 January 2001, Tokyo, Japan. UNL Document Summarization. Virach Sornlertlamvanich, Tanapong Potipiti and Thatsanee Charoenporn Information Research and Development Division - PowerPoint PPT Presentation

Citation preview

Page 1: UNL Document Summarization

UNL Document Summarization

Virach Sornlertlamvanich, Tanapong Potipiti and Thatsanee Charoenporn

Information Research and Development Division

National Electronics and Computer Technology Center (N ECTEC), THAILAND {virach,tanapong,thatsanee}@nec

tec.or.th

The First International Workshop on MultiMedia Annotation- 3 0 3 1 January 2 0 0 1 , Tokyo, Japan

Page 2: UNL Document Summarization

Overview

• UNL project • UNL specification• UNL document summarization

- Sentence score- N-best sentences- Removal of redundant words- Merging sentences

• Conclusion

Page 3: UNL Document Summarization

UNL project

• Initiated by the United Nations Universit yin1 9 9 6

• Collaborationofr esear ch i nst i t ut i ons f r om16countr i es

• I nt er nat i onal semant i c annot at i on s t andar d f or mul t i l i ngual communi cat

i on• - Interlingua based data archiving

Page 4: UNL Document Summarization

UNL and existing MT

• Existing interlingual MT

• UNL

Source language

InterlinguaTarget

languageanalysis generation

Errors in analysis are propagated into the generation process.

No errors in analysis is propagated into the generation process.

User UNL documentpreparing generation

Target language

Page 5: UNL Document Summarization

UNL specification

• Interlingual representation- Nodes: UWs(i nt er l i ngual accept at i ons)- Links: UNL semant i c r el at i ons such as agt, obj, pur ...

obj pur qua

book (icl>do,obj>room).@entry

bachelor(icl>man).@def room(icl>space) 2

person(icl>body).@pl

The UNL graph representing ‘The bachelor books a room for 2 persons.’

Headword

Restriction

Attribute

Relation

agt

UW

Page 6: UNL Document Summarization

UW specification

• UW format: <headword>(<listofrestrictions>) e.g. book(icl>do, obj>roo

m)• Headword: anEngl i sh wor d r oughl y descr

ibest he UWsense.• Restrictions:

- Inclusion (icl ), field(fld) e.g. car(icl>movable thing)- Relations

UW Class Hierarchy

Page 7: UNL Document Summarization

• Multilinguality UNL is an interlingua for multilingual application.

• Unambiguity UNL is designed to contain no semantic ambiguity.

• Semantic information Employing UNL, semantic information are employed for high quality summarization.

UNL document summarization

Page 8: UNL Document Summarization

Four steps in UNL summarization

1: Calculating sentence scores

2: Selecting n-best sentences

3: Removing redundant words

4: Merging sentences

Summary

UNL document

Page 9: UNL Document Summarization

1: Calculating sentence scores

• A sentence score is calculated as follows:

where S sentence scoring function s considered sentence W weighting function uwi universal word Tf term frequency Idf inverted document frequency

)()()(

)()(

iii

Suwi

uwIdfuwTfuwW

uwWsSi

Page 10: UNL Document Summarization

2: Selecting n-best sentences

• Five sentences with the highest scores are selected from the original 100-sentences (2,000 words) text.

UNL represents the means to facilitate multilingual communication on the information network.

The language exists only on the information network.

UNL is a global-scale common language, being transparent to all languages.

Information encoded in UNL is converted to an equivalent counterpart written in the target language, through a language generator "deconvertor" prepared for each language.

Complying with the same technical standard, these computer networks comprise the Internet.

Page 11: UNL Document Summarization

3: Removing redundant words-1

• Insignificant modifier words are removed. Modifier relations are man, mod, ben and such.

Where, Con contribution functionl() considered UNL relationW uw weighting functionuw

1 head uw

uw2

dependent uw

The links are removed if the contribution score is less than a threshold (1.5 in the experiment).

Page 12: UNL Document Summarization

3: Removing redundant words-2

15Threshold of the contribution score is .

“ UNL represents the means to facilitate multilingual com munication on the information network.”

Con(met(facilitate.@pred, means.@def)) = 4.27

Con(mod(communication, network.@def)) = 1.81 Con(mod(communication, multilingual.@indef)) = 1 .78

Con(mod(network.@def, information)) = 0 .4 7

removed

Page 13: UNL Document Summarization

3: Removing redundant words-3

Removed wordsinformation

only, information

common, all

Through a language generator deconvertor" prepared for each language

same, computer

Sentences

UNL represents the means to facilitate multilingual communication on the information network.

The language exists only on the information network.

UNL is a global-scale common language, being transparent to all languages.

Information encoded in UNL is converted to an equivalent counterpart written in the target language, through a language generator "deconvertor" prepared for each language.

Complying to the same technical standard, these computer networks comprise the Internet.

Page 14: UNL Document Summarization

network.@def

language.@entry.@pred.@present

UNL

aoj aoj

global-scale

language:02

mod

exist.@pred.@present

obj

lpl

4: Merging sentences-1

• The UNL sentences sharing the same UW are possibly merged to produce a more complex sentence.

language.@ def

exist.@ entry.@ pred.@ present

netw ork.@ def

objlpl1

2 1+2

language.@ entry.@ pred.@ present.@ indef

global-scale

UNL

aoj aoj

language:02

mod

Page 15: UNL Document Summarization

4: Merging sentences-2

The first sentence generated in English

The language exists on the network.

The second sentence generated in English

The UNL language is a global-scale language.

The merged sentence generated in English

The UNL language is a global-scale language existing on the network.

Page 16: UNL Document Summarization

Text vs. UNL summarization

Plain text summarization

UNL represents the means to facilitate multilingual communication on the information network. The language exists only on the information network. UNL is a global-scale common language, being transparent to all languages. Information encoded in UNL is converted to an equivalent counterpart written in the target language, through a language generator "deconvertor" prepared for each language. Complying with the same technical standards, these computer networks comprise the Internet.

5 sentences, 67 words.

UNL document summarization

UNL represents the means to facilitate multilingual communication on the network. UNL is a global- scale language, being transparent to languages, existing on the network. Information encoded in UNL is converted to counterpart written in the target language. These networks comprise the Internet, complying with the technical standard.

4 sentences, 47 words.

Page 17: UNL Document Summarization

Conclusion

• The process of summarization by UNL has been presented

• UNL provides many advantages in summarization• Our experiment shows that UNL can improve the

quality of summarization• Applicable to any semantic representation• Further research

Considering UW class hierarchy as well as the attributes and relations