ParaPhraser: Russian Paraphrase Corpus and Shared Task Elena Yagunova, Ekaterina Pronoza, ..
Saint-Petersburg State University& Co
ParaPhraser.ru2016
Our approach (at all)As part of our project ParaPhraser on the
identification and classification of Russian paraphrase, we have collected a corpus of more than 8000 sentence pairs annotated as precise, loose or non-paraphrases.
It is annotated via crowdsourcing by naïve native Russian speakers
Our paraphrase corpus is collected from news headlines and therefore can be considered a summarized news stream describing the most important events.
At firstThe aims of our paraphrase
project:To create a publicly available
Russian sentential paraphrase corpus
The corpus is NOT intended to be a general-purpose one
Potential applications: Information Extraction, Text Summarization
Sources of paraphrasesNatural
◦parallel multilingual corpora◦comparable monolingual corpora◦different translations of the same stories/novels
◦news collections◦texts from social networks
Artificial
Other corporaMicrosoft Research Paraphrase
Corpus (2004)◦5801 sentences, 67% paraphrases◦2 classes: paraphrases and non-paraphrases
◦annotated by 2 expertsMETER corpus, Knight and Marcu
corpus, User Language Paraphrase Corpus, PPDB, SEMILAR, Twitter Paraphrase Corpus, etc.
Our unsupervised approach
Parse news articles in real timeCalculate similarity metric for each pair of headlines from different media agencies at the small period of time.
Include candidate pairs in the corpus
Annotate candidate pairs (using crowdsourcing)
Unsupervised Similarity Metric
Extends matrix metric by Fernando and Stevenson:
Is calculated according to the rules:◦ capitalized identical words starting with capital letters ->
1.2◦ identical words -> 1◦ synonyms -> Npmi, Dice or Jaccard synset
coefficient multiplied by 0.8◦ substrings-> the score equal to the length of the smaller
word divided by the length of the larger word and multiplied by 0.7
◦ common prefix (at least 3 characters) -> the score equal to the prefix length divided by the length of the lesser word and multiplied by 0
◦ otherwise -> 0
babWabasim
),(
Paraphrase classesPrecise paraphrase
◦КНДР аннулировала договор о ненападении с Южной Кореей.
DPRK annulled the non-aggression treaty with South Korea.◦КНДР вышла из соглашений о ненападении с
Южной Кореей.DPRK withdrew from the non-aggression agreement with South Korea.
Loose paraphrase◦ВТБ может продать долю в Tele2 в ближайшие
недели.VTB might sell its shares in TELE2 in the nearest weeks.◦ВТБ анонсировал продажу Tele2.VTB announced the sale of TELE2.
Paraphrase classesLoose paraphrase
◦ВТБ может продать долю в Tele2 в ближайшие недели.
VTB might sell its shares in TELE2 in the nearest weeks.◦ВТБ анонсировал продажу Tele2.VTB announced the sale of TELE2.
Non-paraphrase◦В главном здании МГУ загорелась столовая.The student canteen has lit in the main building of MSU.◦Из главного здания МГУ эвакуированы около
300 человек.About 300 people are evacuated from the main building of MSU.
Annotation• Each pair of sentences is annotated by at least 3 users• Pairs annotated by less than 4 users with opposite
judgements are cut off
ParaPhraser8072 sentence pairs
3 classes:
1862 precise (23% )3257 loose (40%)
2953 non-paraphrases (37%)
Linguistic characteristics
Evaluation
Method Result, %
Unsupervised similarity metric, 2 classes (Precision)
80.24
Supervised similarity metric, 3 classes (F1)
60.31
New supervised approach, 3 classes (F1)
63.94
New supervised approach, 2 classes (F1)
82.46
At second (I)oAnnotated via crowdsourcing: 3 paraphrase classesoDear X, please, evaluate the similarity of sense:
◦In Transnistria, the Stabilization Fund has been created for the needs of the President and the KGB
◦Transnistria has created the Stabilization Fund at the cost of the Russian gas
oWe compare prediction models based on different sentence similarity measures
At second (II)oWe compare prediction models based on different sentence similarity measures:oShallow (edit distance, longest common sequences, BLEU, word/character level overlap, etc.)oSemantic (dictionary-based) (use semantic resources like WordNet)oDistributional (distributional semantic models)
oWe analyze linguistic characteristics of the misclassified sentencesoWe analyze the level of agreement between the annotators
Level of Agreement between the Annotators: Cohen’s Kappa
Percentage of Different Linguistic Phenomena among the Most Disagreed-On Sentence Pairs (Top-100)
Linguistic Phenomenon % Linguistic Phenomenon %
Different content 68 Synonymy 11
Presupposition 43 Context knowledge 11
Syntactic synonymy 28 Different time 8
Quotation 22 Metonymy 6
Phrasal synonymy 20 Transliteration 3
Reordering 19 Metaphor 2
Numeral 12
Feature set Precision,% Recall, % F1-score, %shallow 62.42 61.04 60.67semantic 62.41 59.28 58.78distrib 61.22 58.25 57.16distrib + cosine 60.63 57.69 56.54distrib + cosine_ext 61.02 58.17 57.22shallow + semantic 63.75 62.15 62.02shallow + distrib 63.49 61.83 61.61shallow + distrib + cosine 63.42 61.67 61.42shallow + distrib + cosine_ext 64.04 62.39 62.19shallow + semantic + distrib 63.72 62.23 62.04shallow + semantic + distrib + cosine 64.05 62.87 62.68shallow + semantic + distrib + cosine_ext 65.73 63.90 63.66shallow + semantic + cosine 63.82 62.55 62.31shallow + semantic + cosine + ext 64.42 62.95 62.72
Evaluation of Different Feature Sets
Class 1 (Precise Paraphrases)Sentence pairs 1-3 show that all the three feature types fail
to detect or to understand pragmatic presupposition (see “convict” and “Supreme Court” in #1: only a convicted person can be a subject to the cancelled sentence, and the court is supposed to cancel the sentence.
In #2 “Donbass” as the reference to the place of action in the first sentence is also obvious, especially for the Russian speaker, if the action concerns the martial law imposed to respond to the attack by the militia) as well as syntactic synonymy combined with word and phrase-level synonymy and reordering.
In #3 precise paraphrase class is disputable. For a naïve Russian speaker “tourists” might be identical to
“Russian tourists” in a news report due to the presupposition phenomenon, especially if it is a Russian news report; sure it is not a general truth, but the prediction based on general expectations of our group annotators.
Class 0 (Loose paraphrases (conveying similar meaning))
Sentence pairs 4-7 show that all the three feature types fail to detect or to understand presupposition or to understand context or to understand communication phrase structure .
In #6 there is a “difficult” sentence pair: understood metaphorically, the sentences might be considered somewhat similar, however, such understanding requires large amounts of general knowledge and it is extremely hard to teach a machine to distinguish such subtle meanings.
The main part of the difficulty concerned on the different variants of the communication phrase structure choice:
ex. In Transnistria, the Stabilization Fund has been created / for the needs of the President and the KGB || Transnistria has created the Stabilization Fund / at the cost of the Russian gas.
Class 0 (Loose paraphrases (conveying similar meaning))
The main part of the information structure from point of view annotators - marked by bold , the unimportant part – marked by italic.
This is reason of the difference of the definition of the “true class” (annotators result) and the tree prediction classes.
It was similar with #8: “Transnistria has created the Stabilization
Fund” and “In Transnistria, the Stabilization Fund has been created” are the most important parts in the phrase structure.
The similar situation is about #9, but structure of the phrase is more simple.
What is more important: Turkish police have detained three sons of ministers or the reason (that they have been arrested for corruption)?
Some words at lastOur paraphrase corpus is collected from news
headlines and therefore can be considered a summarized news stream describing the most important events.
By building a graph of paraphrases, we can detect such events.
We construct two types of graphs: based on the current human annotation and on the complex model prediction.
The structure of the graphs is compared and analyzed and it is shown that the model graph has larger connected components which give a more complete picture of the important events than the human annotation graph.
Predictive model appears to be better at capturing full information about the important events from the news collection than human annotators.
Corpus splittingThe corpus is divided into two parts: the graphs are constructed on the
events since 2015 the model is trained on the part
referring to 2013-2014 training and testing set are not chosen
randomly (any possible chains of events are not lost)
do not employ any other data (need to compare the graphs)
Whether the connected components corresponding to the same event are really larger in the “model” graph?
Out of 50 cc of the “model” graph: 42 (84%) are larger than the corresponding
components of the “annotators” graph, 23 (46%) connected components correspond to
2 or more “annotators” components each, 10 (20%) connected components correspond to
several “annotators” components each by mistake (false positives).
2 (4%) of the connected components in the “model” graph should have been combined into a single component, but they are not (false negatives).
Top-5 connected components of the two
graphsAnnotators ModelEarthquake in Nepal Earthquake in Nepal +
avalanche on Everest (#2 in the annotators’ graph) + a few sentences about other disasters
“Immortal regiment” march
“Immortal regiment” march
The space truck “Progress M-27M”
The space truck “Progress M-27M”
Evacuation from Nepal by a Russian aircraft
Evacuation from Nepal by a Russian aircraft
Elections in Kazakhstan Elections in Kazakhstan
Results I Connected components(“cc”) in the “model”
graph are larger than cc in the “annotators” The “model” cc are usually formed by joining
several “annotators” cc referring to the same topic
Central nodes often stay the same from graph to graph (the shortest and simplest sentences are likely to have the largest node degrees)
In general the “model” cc give a more complete picture of the described events
Results II Based on the n-grams overlap, the “model”
can join absolutely different events together (for example, fire in Orel, evacuation of people from Mi-8 and clashes in Peru in the 4th largest “annotators” connected component)
Sometimes “model” components can miss some nodes, for example, the component with headlines about “Progress M-27M” lacks one node which is present in the “annotators” graph
Conclusion
We create a publicly available Russian sentential paraphrase corpus
The corpus can be applied for multiple purposes
The corpus is aimed for news with heterogeneous information structure
The corpus can be considered a news stream describing the most important events occurring in the world
THANK YOU FOR YOUR ATTENTION
and join us at ParaPhraser.ru!