The Computational Linguistics Summarization Pilot task @ TAC 2014 Kokil Jaidka †, Muthu Kumar...

Preview:

Citation preview

The Computational Linguistics Summarization Pilot task @ TAC

2014Kokil Jaidka†, Muthu Kumar Chandrasekaran* ‡, Min-Yen Kan* ‡, Ankur Khanna ‡

Nanyang Technological University †

Dept. of Computer Science, National University of Singapore *Web, IR / NLP Group ‡, National University of Singapore

Scientific Document Summarization

I have an abstract. I am done!

Photo Credits Dennis Jarvis @flickr

2TAC Biomedsumm Track - The Computational Linguistics Pilot Task21 April 2023

Outline• Citation based extractive summaries• Facetted summaries• Automatic literature review• CL development corpus• Annotation• TAC 2015: CL-Summ track• Acknowledgements

3TAC Biomedsumm Track - The Computational Linguistics Pilot Task21 April 2023

Scientific Document Summarization

• Abstracts– Authors’ own summary.

• Citation summary– Scientific community creates summaries of

research papers while they cite a paper but…

• Facetted summaries– Capture all aspects of a paper.

5TAC Biomedsumm Track - The Computational Linguistics Pilot Task21 April 2023

21 April 2023 TAC Biomedsumm Track - The Computational Linguistics Pilot Task 6

Citation summary & facets

Image credits Ken Ammi @flickr

Structured Abstract:Common in Medicine, Biomed,Bioinformatics domains

Facetted summaries

7TAC Biomedsumm Track - The Computational Linguistics Pilot Task21 April 2023

Facets & Argumentative zones

21 April 2023 TAC Biomedsumm Track - The Computational Linguistics Pilot Task 8

Scientific Document SummarizationCitation based extractive summaries

Scope of Citation• Qazvinian, V., & Radev, D. R. “Identifying non-explicit citing

sentences for citation-based summarization” (ACL, 2010)

• Abu-Jbara, Amjad, and Dragomir Radev. "Reference scope identification in citing sentences.” (ACL, 2012)

Coherence• Abu-Jbara, Amjad, and Dragomir Radev. "Coherent citation-

based summarization of scientific papers.” (ACL 2011)

9TAC Biomedsumm Track - The Computational Linguistics Pilot Task21 April 2023

Scientific Document Summarization & Automatic Literature Review

10TAC Biomedsumm Track - The Computational Linguistics Pilot Task21 April 2023

11TAC Biomedsumm Track - The Computational Linguistics Pilot Task21 April 2023

Scientific Document Summarization & Automatic Literature Review

Free to access at: http://acl-arc.comp.nus.edu.sg/

12TAC Biomedsumm Track - The Computational Linguistics Pilot Task21 April 2023

SciSumm Corpus• 10 reference papers or topics randomly

sampled from the ACL ARC corpus.• Upto 10 citing papers per reference paper

including those outside ACL ARC.

13TAC Biomedsumm Track - The Computational Linguistics Pilot Task21 April 2023

Annotation pipeline

21 April 2023 TAC Biomedsumm Track - The Computational Linguistics Pilot Task 14

AUTOMA

TIC SUMAUTOMA

TIC SUM

SCI DOC

SUMMSCI DOC

SUMM

<xml>

<abstract>

…….

</abstract>…

<xml>

<abstract>

…….

</abstract>…

…<xml>

<abstract>

…….

</abstract>…

<xml>

<abstract>

…….

</abstract>…

Annotation!

Post Processing to Biomedsumm format:

1.Scripts from U. Colorado (Prabha)

2.Sentence segmented version from U.Mich (Rahul)

OCR & section parse

OCR & section parse

ParsCit ‘s:SectLabel module

• 3 annotators in all.• Released data has one gold standard

annotation per topic or reference paper.• Discourse facet has a minor change from

Biomedsumm’s categories.

Annotating the SciSumm corpus

15TAC Biomedsumm Track - The Computational Linguistics Pilot Task21 April 2023

• Task 1A: For each citance, identify the spans of text (cited text spans) in the RP that most accurately reflect the citance

Tasks

16TAC Biomedsumm Track - The Computational Linguistics Pilot Task21 April 2023

Reference Paper (RP)Reference Paper (RP)

Citing papers.Citing text is called citance

Tasks• Task 1B: For each cited text span,

identify what facet of the paper it belongs to, from a predefined set of facets.

17TAC Biomedsumm Track - The Computational Linguistics Pilot Task21 April 2023

Reference Paper (RP)Reference Paper (RP)

Mark the cited text in RP and provide its facet.

Citing papers.Citing text is called citance

Evaluation• Small corpus: 10 fold cross validated

evaluation over the 10 documents.• Task 1a scored by overlap with

citances.• Task 1b scored by overlap with

reference text spans.

TAC Biomedsumm Track - The Computational Linguistics Pilot Task 18

Task & evaluation: highlights

• First corpus in the CL that incorporates prior research findings on citation based summaries.

• 10 teams from 5 different countries participated in the evaluation.

19TAC Biomedsumm Track - The Computational Linguistics Pilot Task21 April 2023

Limitations• No gold standard summaries yet

• OCR errors: We hope to have corrected them manually.

• But mainly, we need more annotated data!

21 April 2023 TAC Biomedsumm Track - The Computational Linguistics Pilot Task 20

TAC 2015: CL-Summ shared task

• Plans to rollout a full-fledged official shared task for the CL corpus.

• 20 training topics

• 10 test topics

• 3 annotations per summary.

21 April 2023 TAC Biomedsumm Track - The Computational Linguistics Pilot Task 21

TAC 2015: We need you help!

• We seek support from– summarization community in general and – CL community in particular

to provide manpower for annotating the corpus

• Great to have all participating teams contribute!

21 April 2023 TAC Biomedsumm Track - The Computational Linguistics Pilot Task 22

Acknlowledgements• Hoa Dang, NIST

• Lucy Vanderwende, MSR

• All Biomedsumm track participants.

• This research is partially supported by CSIDM

23TAC Biomedsumm Track - The Computational Linguistics Pilot Task21 April 2023

Questions? Thank you!

Recommended