Download pptx - AINL 2016: Yagunova

ParaPhraser: Russian Paraphrase Corpus and Shared Task Elena Yagunova, Ekaterina Pronoza, ..

Saint-Petersburg State University& Co

ParaPhraser.ru2016

Our approach (at all)As part of our project ParaPhraser on the

identification and classification of Russian paraphrase, we have collected a corpus of more than 8000 sentence pairs annotated as precise, loose or non-paraphrases.

It is annotated via crowdsourcing by naïve native Russian speakers

Our paraphrase corpus is collected from news headlines and therefore can be considered a summarized news stream describing the most important events.

At firstThe aims of our paraphrase

project:To create a publicly available

Russian sentential paraphrase corpus

The corpus is NOT intended to be a general-purpose one

Potential applications: Information Extraction, Text Summarization

Sources of paraphrasesNatural

◦parallel multilingual corpora◦comparable monolingual corpora◦different translations of the same stories/novels

◦news collections◦texts from social networks

Artificial

Other corporaMicrosoft Research Paraphrase

Corpus (2004)◦5801 sentences, 67% paraphrases◦2 classes: paraphrases and non-paraphrases

◦annotated by 2 expertsMETER corpus, Knight and Marcu

corpus, User Language Paraphrase Corpus, PPDB, SEMILAR, Twitter Paraphrase Corpus, etc.

Our unsupervised approach

Parse news articles in real timeCalculate similarity metric for each pair of headlines from different media agencies at the small period of time.

Include candidate pairs in the corpus

Annotate candidate pairs (using crowdsourcing)

Unsupervised Similarity Metric

Extends matrix metric by Fernando and Stevenson:

Is calculated according to the rules:◦ capitalized identical words starting with capital letters ->

1.2◦ identical words -> 1◦ synonyms -> Npmi, Dice or Jaccard synset

coefficient multiplied by 0.8◦ substrings-> the score equal to the length of the smaller

word divided by the length of the larger word and multiplied by 0.7

◦ common prefix (at least 3 characters) -> the score equal to the prefix length divided by the length of the lesser word and multiplied by 0

◦ otherwise -> 0

babWabasim

),(

Paraphrase classesPrecise paraphrase

◦КНДР аннулировала договор о ненападении с Южной Кореей.

DPRK annulled the non-aggression treaty with South Korea.◦КНДР вышла из соглашений о ненападении с

Южной Кореей.DPRK withdrew from the non-aggression agreement with South Korea.

Loose paraphrase◦ВТБ может продать долю в Tele2 в ближайшие

недели.VTB might sell its shares in TELE2 in the nearest weeks.◦ВТБ анонсировал продажу Tele2.VTB announced the sale of TELE2.

Paraphrase classesLoose paraphrase

◦ВТБ может продать долю в Tele2 в ближайшие недели.

VTB might sell its shares in TELE2 in the nearest weeks.◦ВТБ анонсировал продажу Tele2.VTB announced the sale of TELE2.

Non-paraphrase◦В главном здании МГУ загорелась столовая.The student canteen has lit in the main building of MSU.◦Из главного здания МГУ эвакуированы около

300 человек.About 300 people are evacuated from the main building of MSU.

Annotation• Each pair of sentences is annotated by at least 3 users• Pairs annotated by less than 4 users with opposite

judgements are cut off

ParaPhraser8072 sentence pairs

3 classes:

1862 precise (23% )3257 loose (40%)

2953 non-paraphrases (37%)

Linguistic characteristics

Evaluation

Method Result, %

Unsupervised similarity metric, 2 classes (Precision)

80.24

Supervised similarity metric, 3 classes (F1)

60.31

New supervised approach, 3 classes (F1)

63.94

New supervised approach, 2 classes (F1)

82.46

At second (I)oAnnotated via crowdsourcing: 3 paraphrase classesoDear X, please, evaluate the similarity of sense:

◦In Transnistria, the Stabilization Fund has been created for the needs of the President and the KGB

◦Transnistria has created the Stabilization Fund at the cost of the Russian gas

oWe compare prediction models based on different sentence similarity measures

At second (II)oWe compare prediction models based on different sentence similarity measures:oShallow (edit distance, longest common sequences, BLEU, word/character level overlap, etc.)oSemantic (dictionary-based) (use semantic resources like WordNet)oDistributional (distributional semantic models)

oWe analyze linguistic characteristics of the misclassified sentencesoWe analyze the level of agreement between the annotators

Level of Agreement between the Annotators: Cohen’s Kappa

Percentage of Different Linguistic Phenomena among the Most Disagreed-On Sentence Pairs (Top-100)

Linguistic Phenomenon % Linguistic Phenomenon %

Different content 68 Synonymy 11

Presupposition 43 Context knowledge 11

Syntactic synonymy 28 Different time 8

Quotation 22 Metonymy 6

Phrasal synonymy 20 Transliteration 3

Reordering 19 Metaphor 2

Numeral 12

Feature set Precision,% Recall, % F1-score, %shallow 62.42 61.04 60.67semantic 62.41 59.28 58.78distrib 61.22 58.25 57.16distrib + cosine 60.63 57.69 56.54distrib + cosine_ext 61.02 58.17 57.22shallow + semantic 63.75 62.15 62.02shallow + distrib 63.49 61.83 61.61shallow + distrib + cosine 63.42 61.67 61.42shallow + distrib + cosine_ext 64.04 62.39 62.19shallow + semantic + distrib 63.72 62.23 62.04shallow + semantic + distrib + cosine 64.05 62.87 62.68shallow + semantic + distrib + cosine_ext 65.73 63.90 63.66shallow + semantic + cosine 63.82 62.55 62.31shallow + semantic + cosine + ext 64.42 62.95 62.72

Evaluation of Different Feature Sets

Class 1 (Precise Paraphrases)Sentence pairs 1-3 show that all the three feature types fail

to detect or to understand pragmatic presupposition (see “convict” and “Supreme Court” in #1: only a convicted person can be a subject to the cancelled sentence, and the court is supposed to cancel the sentence.

In #2 “Donbass” as the reference to the place of action in the first sentence is also obvious, especially for the Russian speaker, if the action concerns the martial law imposed to respond to the attack by the militia) as well as syntactic synonymy combined with word and phrase-level synonymy and reordering.

In #3 precise paraphrase class is disputable. For a naïve Russian speaker “tourists” might be identical to

“Russian tourists” in a news report due to the presupposition phenomenon, especially if it is a Russian news report; sure it is not a general truth, but the prediction based on general expectations of our group annotators.

Class 0 (Loose paraphrases (conveying similar meaning))

Sentence pairs 4-7 show that all the three feature types fail to detect or to understand presupposition or to understand context or to understand communication phrase structure .

In #6 there is a “difficult” sentence pair: understood metaphorically, the sentences might be considered somewhat similar, however, such understanding requires large amounts of general knowledge and it is extremely hard to teach a machine to distinguish such subtle meanings.

The main part of the difficulty concerned on the different variants of the communication phrase structure choice:

ex. In Transnistria, the Stabilization Fund has been created / for the needs of the President and the KGB || Transnistria has created the Stabilization Fund / at the cost of the Russian gas.

Class 0 (Loose paraphrases (conveying similar meaning))

The main part of the information structure from point of view annotators - marked by bold , the unimportant part – marked by italic.

This is reason of the difference of the definition of the “true class” (annotators result) and the tree prediction classes.

It was similar with #8: “Transnistria has created the Stabilization

Fund” and “In Transnistria, the Stabilization Fund has been created” are the most important parts in the phrase structure.

The similar situation is about #9, but structure of the phrase is more simple.

What is more important: Turkish police have detained three sons of ministers or the reason (that they have been arrested for corruption)?

Some words at lastOur paraphrase corpus is collected from news

headlines and therefore can be considered a summarized news stream describing the most important events.

By building a graph of paraphrases, we can detect such events.

We construct two types of graphs: based on the current human annotation and on the complex model prediction.

The structure of the graphs is compared and analyzed and it is shown that the model graph has larger connected components which give a more complete picture of the important events than the human annotation graph.

Predictive model appears to be better at capturing full information about the important events from the news collection than human annotators.

Corpus splittingThe corpus is divided into two parts: the graphs are constructed on the

events since 2015 the model is trained on the part

referring to 2013-2014 training and testing set are not chosen

randomly (any possible chains of events are not lost)

do not employ any other data (need to compare the graphs)

Whether the connected components corresponding to the same event are really larger in the “model” graph?

Out of 50 cc of the “model” graph: 42 (84%) are larger than the corresponding

components of the “annotators” graph, 23 (46%) connected components correspond to

2 or more “annotators” components each, 10 (20%) connected components correspond to

several “annotators” components each by mistake (false positives).

2 (4%) of the connected components in the “model” graph should have been combined into a single component, but they are not (false negatives).

Top-5 connected components of the two

graphsAnnotators ModelEarthquake in Nepal Earthquake in Nepal +

avalanche on Everest (#2 in the annotators’ graph) + a few sentences about other disasters

“Immortal regiment” march

“Immortal regiment” march

The space truck “Progress M-27M”

The space truck “Progress M-27M”

Evacuation from Nepal by a Russian aircraft

Evacuation from Nepal by a Russian aircraft

Elections in Kazakhstan Elections in Kazakhstan

Results I Connected components(“cc”) in the “model”

graph are larger than cc in the “annotators” The “model” cc are usually formed by joining

several “annotators” cc referring to the same topic

Central nodes often stay the same from graph to graph (the shortest and simplest sentences are likely to have the largest node degrees)

In general the “model” cc give a more complete picture of the described events

Results II Based on the n-grams overlap, the “model”

can join absolutely different events together (for example, fire in Orel, evacuation of people from Mi-8 and clashes in Peru in the 4th largest “annotators” connected component)

Sometimes “model” components can miss some nodes, for example, the component with headlines about “Progress M-27M” lacks one node which is present in the “annotators” graph

Conclusion

We create a publicly available Russian sentential paraphrase corpus

The corpus can be applied for multiple purposes

The corpus is aimed for news with heterogeneous information structure

The corpus can be considered a news stream describing the most important events occurring in the world

THANK YOU FOR YOUR ATTENTION

and join us at ParaPhraser.ru!