42
How do Order and Proximity Impact the Readability of Event Summaries? Klaus Berberich ([email protected]) Arunav Mishra ([email protected])

How do Order and Proximity Impact the Readability of Event ...kberberi/... · ๏ CrowdFlower job ran between Sep 28th and Oct 1st, 2016 ๏ 700 summary pairs in total, out of which

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

How do Order and ProximityImpact the Readabilityof Event Summaries?

Klaus Berberich ([email protected])

Arunav Mishra([email protected])

/ 32Impact of Order and Proximity on Readability (Klaus Berberich) 2

Motivation๏ Our own earlier work (SIGIR 2016) proposed an extractive

summarization method to generate digests of news eventsand relied on ROUGE-N for its evaluation

๏ ROUGE-N, as established framework for measuring quality of automatically generated text summary relative toa human-generated gold-standard summaryis not sensitive to sentence order

/ 32Impact of Order and Proximity on Readability (Klaus Berberich) 3

ROUGE-N๏ ROUGE-N considers N-grams and measures the precision,

recall, and F1-measure of the automatically generated summary relative to the gold-standard summary

ECIR is the premier European forum for the presentation of new research in the field of Information Retrieval. Robert Gordon University is very pleased to host this important event.   We are delighted to announce that ECIR 2017 (European Conference on Information Retrieval) will be re tu r n i ng to Abe rdeen , Scotland for the first time since 1997.

ECIR is a European forum focused on In format ion Retrieval. In 2017 it will take place in Aberdeen, Scotland. This is the first time that the c o n f e r e n c e r e t u r n s t o Aberdeen since 1997.

This is the first time that the c o n f e r e n c e r e t u r n s t o Aberdeen since 1997. In 2017 it will take place in Aberdeen, Scotland. ECIR is a European forum focused on Information Retrieval.

Summary 1 Summary 2

/ 32Impact of Order and Proximity on Readability (Klaus Berberich) 3

ROUGE-N๏ ROUGE-N considers N-grams and measures the precision,

recall, and F1-measure of the automatically generated summary relative to the gold-standard summary

ECIR is the premier European forum for the presentation of new research in the field of Information Retrieval. Robert Gordon University is very pleased to host this important event.   We are delighted to announce that ECIR 2017 (European Conference on Information Retrieval) will be re tu r n i ng to Abe rdeen , Scotland for the first time since 1997.

ECIR is a European forum focused on In format ion Retrieval. In 2017 it will take place in Aberdeen, Scotland. This is the first time that the c o n f e r e n c e r e t u r n s t o Aberdeen since 1997.

This is the first time that the c o n f e r e n c e r e t u r n s t o Aberdeen since 1997. In 2017 it will take place in Aberdeen, Scotland. ECIR is a European forum focused on Information Retrieval.

Same ROUGE-N scores!

Summary 1 Summary 2

/ 32Impact of Order and Proximity on Readability (Klaus Berberich) 4

Motivation๏ Do sentence order and sentence proximity affect the

readability of an automatically generated text summary as perceived by a human user?

๏ Conduct an empirical study on CrowdFlower to find out ๏ collect a dataset that can be used to evaluate novel quality

measures for text summarization aware of sentence order

๏ make dataset publicly available for others to use

/ 32Impact of Order and Proximity on Readability (Klaus Berberich) 5

Outline

๏ Motivation

๏ Research Questions

๏ Crowdsourcing Task

๏ Results

๏ Summary

/ 32Impact of Order and Proximity on Readability (Klaus Berberich) 6

Research Questions๏ RQ1: What is the impact of summary structure in terms of

sentence order on the readability of summaries for past events?

๏ RQ2: Does changing the proximity between sentences in a coherent human-written summary affect its readability?

๏ RQ3: How feasible is it to evaluate the structure of a fixed-length summary for a past event through crowdsourcing?

/ 32Impact of Order and Proximity on Readability (Klaus Berberich) 7

Outline

๏ Motivation

๏ Research Questions

๏ Crowdsourcing Job

๏ Results

๏ Summary

/ 32Impact of Order and Proximity on Readability (Klaus Berberich) 8

Crowdsourcing Job๏ Where can we obtain human-generated event summaries

that can serve as our gold-standard summaries?

๏ How can we systematically modify the obtained gold-standard summaries to answer the first two of our research questions?

๏ What measures can be taken to ensure that the collected preference judgments are of high quality?

๏ CrowdFlower is used as a crowdsourcing platform due to its availability in Europe and its relative ease-of-use

/ 32Impact of Order and Proximity on Readability (Klaus Berberich) 9

Identifying Seminal Events๏ Wikipedia’s “Timeline of modern history” as

a source of seminal events with news coverage

๏ 100 events from the period 1987-2007 are selected randomly

๏ How to obtain a ground-truth summary for each of them?

/ 32Impact of Order and Proximity on Readability (Klaus Berberich) 10

Obtaining Ground-Truth Summaries๏ Wikipedia article with full details on event identified manually

/ 32Impact of Order and Proximity on Readability (Klaus Berberich) 11

Obtaining Ground-Truth Summaries๏ 10 sentences at the begin of the lead section are selected

as a human-written, neutral, linguistically simple summary

๏ Sentences are converted to lower case and co-references are replaced by their corresponding referent

[1] the 2002 gujarat riots, also known as the 2002 gujarat violence and the gujarat pogrom, was a three-day period of inter-communal violence in the western indian state of gujarat. [2] following the initial incident there were further outbreaks of violence in ahmedabad for three weeks ; statewide, there were further outbreaks of communal riots against the minority muslim population for three months. [3] the burning of a train in godhra on 27 february 2002, which caused the deaths of 58 hindu pilgrims karsevaks returning from ayodhya, is believed to have triggered the violence. [4] according to official figures, the communal riots resulted in the deaths of 790 muslims and 254 hindus ; 2,500 people were injured non-fatally, and 223 more were reported missing. [5] other sources estimate that up to 2,500 muslims died …

/ 32Impact of Order and Proximity on Readability (Klaus Berberich) 12

Original Summary[1] the 2002 gujarat riots, also known as the 2002 gujarat violence and the gujarat pogrom, was a three-day period of inter-communal violence in the western indian state of gujarat. [2] following the initial incident there were further outbreaks of violence in ahmedabad for three weeks; statewide, there were further outbreaks of communal riots against the minority muslim population for three months. [3] the burning of a train in godhra on 27 february 2002, which caused the deaths of 58 hindu pilgrims karsevaks returning from ayodhya, is believed to have triggered the violence. [4] according to official figures, the communal riots resulted in the deaths of 790 muslims and 254 hindus ; 2,500 people were injured non-fatally, and 223 more were reported missing. [5] other sources estimate that up to 2,500 muslims died. [6] there were instances of rape, children being burned alive, and widespread looting and destruction of property. [7] the chief minister at that time, narendra modi, has been accused of initiating and condoning the violence, as have police and government officials who allegedly directed the rioters and gave lists of muslim-owned properties to them. [8] in 2012, narendra modi was cleared of complicity in the violence by a special investigation team (sit) appointed by the supreme court of india. [9] the sit also rejected claims that the state government had not done enough to prevent the communal riots. [10] while officially classified as a communalist riot, the events of 2002 have been described as a pogrom by many scholars, with some commentators alleging that the attacks had been planned, were well orchestrated, and that the attack on the train in godhra on 27 february 2002 was a ”staged trigger” for what was actually premeditated violence.

O

/ 32Impact of Order and Proximity on Readability (Klaus Berberich) 13

Shuffled Summary

๏ First sentence is always kept in place

[1] the 2002 gujarat riots, also known as the 2002 gujarat violence and the gujarat pogrom, was a three-day period of inter-communal violence in the western indian state of gujarat. [6] there were instances of rape, children being burned alive, and widespread looting and destruction of property. [2] following the initial incident there were further outbreaks of violence in ahmedabad for three weeks ; statewide, there were further outbreaks of communal riots against the minority muslim population for three months. [10] while officially classified as a communalist riot, the events of 2002 have been described as a pogrom by many scholars, with some commentators alleging that the attacks had been planned, were well orchestrated, and that the attack on the train in godhra on 27 february 2002 was a ”staged trigger” for what was actually premeditated violence. [9] the sit also rejected claims that the state government had not done enough to prevent the communal riots. [8] in 2012, narendra modi was cleared of complicity in the violence by a special investigation team ( sit ) appointed by the supreme court of india. [3] the burning of a train in godhra on 27 february 2002, which caused the deaths of 58 hindu pilgrims karsevaks returning from ayodhya, is believed to have triggered the violence. [7] the chief minister at that time, narendra modi, has been accused of initiating and condoning the violence, as have police and government officials who allegedly directed the rioters and gave lists of muslim-owned properties to them. [4] according to official figures, the communal riots resulted in the deaths of 790 muslims and 254 hindus ; 2,500 people were injured non-fatally, and 223 more were reported missing. [5] other sources estimate that up to 2,500 muslims died.

S

/ 32Impact of Order and Proximity on Readability (Klaus Berberich) 14

Task Design๏ CrowdFlower used as a crowdsourcing platform

to collect preference judgments regardingthe readability of two event summaries

๏ Instructions in simple language provide contributors with ๏ Overview – What is the task about?

๏ Help – What is meant by readability and coherence?

๏ Process – How to proceed when working on the task?

๏ Pro Tips – How to avoid common mistakes?

/ 32Impact of Order and Proximity on Readability (Klaus Berberich) 15

Instructions

/ 32Impact of Order and Proximity on Readability (Klaus Berberich) 16

User Interface

Event Description

Summaries

Preference

Reason

Difficulty

/ 32Impact of Order and Proximity on Readability (Klaus Berberich) 17

Quality Control๏ Potential contributor pool is restricted by

๏ allowing only those flagged as “high quality” by CrowdFlower

๏ requiring that their primary language is set as English

๏ Qualification quiz based on 10% of units judged by authors, in which contributors have to achieve at least 70% accuracy

๏ Ongoing quality assessment based on same 10% of units,for which authors need to keep above 70% accuracy

๏ Traps let contributors compare identical summaries

/ 32Impact of Order and Proximity on Readability (Klaus Berberich) 18

Outline

๏ Motivation

๏ Research Questions

๏ Crowdsourcing Task

๏ Results

๏ Summary

/ 32Impact of Order and Proximity on Readability (Klaus Berberich) 19

Results๏ CrowdFlower job ran between Sep 28th and Oct 1st, 2016

๏ 700 summary pairs in total, out of which 100 are same-summary pairs, which serve as traps for quality control

๏ Tasks shown to contributors consist of five summary pairs,which need to be judged independent from each other

๏ $0.012 paid to contributor per successfully completed task

๏ $83.47 total cost for running the job

/ 32Impact of Order and Proximity on Readability (Klaus Berberich) 20

Experiment 1๏ RQ1: What is the impact of summary structure in terms of

sentence order on the readability of summaries for past events?

๏ Contributors are asked to compare original summary against reversed summary that reverses the order of sentences

๏ For the sake of control, contributors also compare original and reversed summary against shuffled summary

๏ Each pair of summaries shown to three contributors and final preference label obtained via majority vote

/ 32Impact of Order and Proximity on Readability (Klaus Berberich) 21

Reversed Summary

๏ First sentence is always kept in place

[1] the 2002 gujarat riots, also known as the 2002 gujarat violence and the gujarat pogrom, was a three-day period of inter-communal violence in the western indian state of gujarat. [10] while officially classified as a communalist riot, the events of 2002 have been described as a pogrom by many scholars, with some commentators alleging that the attacks had been planned, were well orchestrated, and that the attack on the train in godhra on 27 february 2002 was a ”staged trigger” for what was actually premeditated violence. [9] the sit also rejected claims that the state government had not done enough to prevent the communal riots. [8] in 2012, narendra modi was cleared of complicity in the violence by a special investigation team ( sit ) appointed by the supreme court of india. [7] the chief minister at that time, narendra modi, has been accused of initiating and condoning the violence, as have police and government officials who allegedly directed the rioters and gave lists of muslim-owned properties to them. [6] there were instances of rape, children being burned alive, and widespread looting and destruction of property. [5] other sources estimate that up to 2,500 muslims died. [4] according to official figures, the communal riots resulted in the deaths of 790 muslims and 254 hindus ; 2,500 people were injured non-fatally, and 223 more were reported missing. [3] the burning of a train in godhra on 27 february 2002, which caused the deaths of 58 hindu pilgrims karsevaks returning from ayodhya, is believed to have triggered the violence. [2] following the initial incident there were further outbreaks of violence in ahmedabad for three weeks ; statewide, there were further outbreaks of communal riots against the minority muslim population for three months.

R

/ 32Impact of Order and Proximity on Readability (Klaus Berberich) 22

Results 1

82

18

93

7

4357

0102030405060708090100

O R O W R S

(a) (b) (c)

Num

ber o

f Uni

ts

O vs S R vs SO vs RO R O R SS

0.42 0.47 0.19Fleiss’ 𝜅

/ 32Impact of Order and Proximity on Readability (Klaus Berberich) 22

Results 1

82

18

93

7

4357

0102030405060708090100

O R O W R S

(a) (b) (c)

Num

ber o

f Uni

ts

O vs S R vs SO vs RO R O R SS

0.42 0.47 0.19Fleiss’ 𝜅

Sentence order does matter!

/ 32Impact of Order and Proximity on Readability (Klaus Berberich) 23

Experiment 2๏ RQ2: Does changing the proximity between sentences in a

coherent human-written summary affect its readability?

๏ Contributors are asked to compare original summary against proximity-minimizing summary

๏ For the sake of control, contributors also compare original and proximity-minimizing summary against shuffled summary

๏ Each pair of summaries shown to three contributors and final preference label obtained via majority vote

/ 32Impact of Order and Proximity on Readability (Klaus Berberich) 24

Proximity-Minimizing Summary

๏ First sentence is always kept in place

[1] the 2002 gujarat riots, also known as the 2002 gujarat violence and the gujarat pogrom, was a three-day period of inter-communal violence in the western indian state of gujarat. [6] there were instances of rape, children being burned alive, and widespread looting and destruction of property. [5] other sources estimate that up to 2,500 muslims died. [8] in 2012, narendra modi was cleared of complicity in the violence by a special investigation team ( sit ) appointed by the supreme court of india. [3] the burning of a train in godhra on 27 february 2002, which caused the deaths of 58 hindu pilgrims karsevaks returning from ayodhya, is believed to have triggered the violence. [10] while officially classified as a communalist riot, the events of 2002 have been described as a pogrom by many scholars, with some commentators alleging that the attacks had been planned, were well orchestrated, and that the attack on the train in godhra on 27 february 2002 was a ”staged trigger” for what was actually premeditated violence. [2] following the initial incident there were further outbreaks of violence in ahmedabad for three weeks ; statewide, there were further outbreaks of communal riots against the minority muslim population for three months. [9] the sit also rejected claims that the state government had not done enough to prevent the communal riots. [4] according to official figures, the communal riots resulted in the deaths of 790 muslims and 254 hindus ; 2,500 people were injured non-fatally, and 223 more were reported missing. [7] the chief minister at that time, narendra modi, has been accused of initiating and condoning the violence, as have police and government officials who allegedly directed the rioters and gave lists of muslim-owned properties to them.

P

/ 32Impact of Order and Proximity on Readability (Klaus Berberich) 25

Results 2

89

1141

59 51 49

0102030405060708090100

O C C S C R

(a) (b) (c)

Num

ber

of U

nits

PPPO S RP vs O P vs S P vs R

0.50 0.21 0.27Fleiss’ 𝜅

/ 32Impact of Order and Proximity on Readability (Klaus Berberich) 25

Results 2

89

1141

59 51 49

0102030405060708090100

O C C S C R

(a) (b) (c)

Num

ber

of U

nits

PPPO S RP vs O P vs S P vs R

0.50 0.21 0.27Fleiss’ 𝜅

Sentence proximity does matter!

/ 32Impact of Order and Proximity on Readability (Klaus Berberich) 26

Experiment 3๏ RQ3: How feasible is it to evaluate the structure of a fixed-

length summary for a past event through crowdsourcing?

๏ Contributors are asked to specify difficulty level (1–5) for each comparison of two summaries

๏ Contributors may provide free-text commentsas a justification of their assessment

๏ CrowdFlower asks contributors to judge aspects of the task (e.g., interestingness and difficulty) and precisely records all of their actions

/ 32Impact of Order and Proximity on Readability (Klaus Berberich) 27

Results 3: Difficulty

12 12 15 6 8 6

117 127 12983 92 108

128 132 122159 148 132

38 22 25 36 42 355 7 9 15 10 18

050

100150200

OR OC OS CS CR RS

No.

of J

udgm

ents

very easy easy not so easy difficult very difficult

O vs S R vs SO vs R P vs O P vs S P vs R

3,00%

36,00%

46,00%

11,00%

4,00%

0

0,05

0,1

0,15

0,2

0,25

0,3

0,35

0,4

0,45

0,5

/ 32Impact of Order and Proximity on Readability (Klaus Berberich) 27

Results 3: Difficulty

12 12 15 6 8 6

117 127 12983 92 108

128 132 122159 148 132

38 22 25 36 42 355 7 9 15 10 18

050

100150200

OR OC OS CS CR RS

No.

of J

udgm

ents

very easy easy not so easy difficult very difficult

O vs S R vs SO vs R P vs O P vs S P vs R

3,00%

36,00%

46,00%

11,00%

4,00%

0

0,05

0,1

0,15

0,2

0,25

0,3

0,35

0,4

0,45

0,5

Summary pairs involving original are easier

/ 32Impact of Order and Proximity on Readability (Klaus Berberich) 28

Results 3: Time

๏ CrowdFlower Job took 77 hours to complete

No.

of J

udgm

ents

10 How do Order and Proximity Impact the Readability of Event Summaries?

12 12 15 6 8 6

117 127 12983 92 108

128 132 122159 148 132

38 22 25 36 42 355 7 9 15 10 18

050

100150200

OR OC OS CS CR RSN

o. o

f Jud

gmen

ts

very easy easy not so easy difficult very difficult

O vs S R vs SO vs R P vs O P vs S P vs R(a) Distribution of the Difficulty level submitted by the contributors across the unit subsets.

No.

of J

udgm

ents

(b) Illustration of Judgments per hour (Time zone: GMT+2)Quiz Mode (passed) 55Quiz Mode (failed) 79Work Mode (passed) 32Work Mode (failed) 23

Trusted Judgments 2099Untrusted Judgments 414

(c) Contributor fun-nel statistics

Description Time

Job Run 77 hoursIQM Trusted Judgment 1m 41sIQM Untrusted Judgment 1m 11sIQM Task by Trusted Contributors 8m 27sIQM Task by Untrusted Contributors 5m 58s

(d) Inter Quantile Mean (IQM)temporal statistics

Job Aspects out of 5

Overall 2.9Instruction Clear 2.7Ease of Job 2.6Pay 2.6Test Question Fair 2.8

(e) Contributor sat-isfaction

Length in Words Count

5 15296 to 10 56911 to 15 11216 to 20 38> 20 29

(f) Reason textdistribution

Fig. 9: Experiment Results

first experiment, contributors find the more number of randomly shuffled summaries tobe more coherent than the proximity altered ones. Upon closer examination, we find thatin some randomly shuffled summaries few sentences retain their proximity by chance.These are marked as more coherent.

4.3 Experiment 3: Feasibility of using Crowdflower

The main objective of this experiment is to evaluate the viability of conducting a Crowd-flower study for evaluating a summarization task. To analyze the difficulty of the job,we ask the contributors to specify the difficulty level for each unit they judge. In theuser interface this is given as a set of five radio buttons representing different diffi-culty levels. In addition, we analyze the interestingness of the job, rate of progress, andcontributor satisfaction based on a survey provided by Crowdflower.Results obtained for this experiment are illustrated in Fig. 9. Firstly, we look into thedistribution of the difficulty for each unit specified by the contributors. Overall the judg-ments, we find that 13% of the units are marked as very easy, 33% are easy, 39% arenot so easy, 10% are difficult, and only 3.9% are marked as very difficult. A closerexamination of this distribution for each of the unit subsets are illustrated in Fig. 9a.

The entire job took approximately 77 hours to complete. Fig. 9b illustrates the rateof the judgments acquired over this time interval as obtained from Crowdflower. Asshown in Table 9d, a single trusted judgment took about 1 minutes and 41 seconds onan average. An untrusted judgment takes comparable time of about 1 minutes and 11

/ 32Impact of Order and Proximity on Readability (Klaus Berberich) 28

Results 3: Time

๏ CrowdFlower Job took 77 hours to complete

No.

of J

udgm

ents

10 How do Order and Proximity Impact the Readability of Event Summaries?

12 12 15 6 8 6

117 127 12983 92 108

128 132 122159 148 132

38 22 25 36 42 355 7 9 15 10 18

050

100150200

OR OC OS CS CR RSN

o. o

f Jud

gmen

ts

very easy easy not so easy difficult very difficult

O vs S R vs SO vs R P vs O P vs S P vs R(a) Distribution of the Difficulty level submitted by the contributors across the unit subsets.

No.

of J

udgm

ents

(b) Illustration of Judgments per hour (Time zone: GMT+2)Quiz Mode (passed) 55Quiz Mode (failed) 79Work Mode (passed) 32Work Mode (failed) 23

Trusted Judgments 2099Untrusted Judgments 414

(c) Contributor fun-nel statistics

Description Time

Job Run 77 hoursIQM Trusted Judgment 1m 41sIQM Untrusted Judgment 1m 11sIQM Task by Trusted Contributors 8m 27sIQM Task by Untrusted Contributors 5m 58s

(d) Inter Quantile Mean (IQM)temporal statistics

Job Aspects out of 5

Overall 2.9Instruction Clear 2.7Ease of Job 2.6Pay 2.6Test Question Fair 2.8

(e) Contributor sat-isfaction

Length in Words Count

5 15296 to 10 56911 to 15 11216 to 20 38> 20 29

(f) Reason textdistribution

Fig. 9: Experiment Results

first experiment, contributors find the more number of randomly shuffled summaries tobe more coherent than the proximity altered ones. Upon closer examination, we find thatin some randomly shuffled summaries few sentences retain their proximity by chance.These are marked as more coherent.

4.3 Experiment 3: Feasibility of using Crowdflower

The main objective of this experiment is to evaluate the viability of conducting a Crowd-flower study for evaluating a summarization task. To analyze the difficulty of the job,we ask the contributors to specify the difficulty level for each unit they judge. In theuser interface this is given as a set of five radio buttons representing different diffi-culty levels. In addition, we analyze the interestingness of the job, rate of progress, andcontributor satisfaction based on a survey provided by Crowdflower.Results obtained for this experiment are illustrated in Fig. 9. Firstly, we look into thedistribution of the difficulty for each unit specified by the contributors. Overall the judg-ments, we find that 13% of the units are marked as very easy, 33% are easy, 39% arenot so easy, 10% are difficult, and only 3.9% are marked as very difficult. A closerexamination of this distribution for each of the unit subsets are illustrated in Fig. 9a.

The entire job took approximately 77 hours to complete. Fig. 9b illustrates the rateof the judgments acquired over this time interval as obtained from Crowdflower. Asshown in Table 9d, a single trusted judgment took about 1 minutes and 41 seconds onan average. An untrusted judgment takes comparable time of about 1 minutes and 11

Qualification quiz is effective (58% banned)

/ 32Impact of Order and Proximity on Readability (Klaus Berberich) 28

Results 3: Time

๏ CrowdFlower Job took 77 hours to complete

No.

of J

udgm

ents

10 How do Order and Proximity Impact the Readability of Event Summaries?

12 12 15 6 8 6

117 127 12983 92 108

128 132 122159 148 132

38 22 25 36 42 355 7 9 15 10 18

050

100150200

OR OC OS CS CR RSN

o. o

f Jud

gmen

ts

very easy easy not so easy difficult very difficult

O vs S R vs SO vs R P vs O P vs S P vs R(a) Distribution of the Difficulty level submitted by the contributors across the unit subsets.

No.

of J

udgm

ents

(b) Illustration of Judgments per hour (Time zone: GMT+2)Quiz Mode (passed) 55Quiz Mode (failed) 79Work Mode (passed) 32Work Mode (failed) 23

Trusted Judgments 2099Untrusted Judgments 414

(c) Contributor fun-nel statistics

Description Time

Job Run 77 hoursIQM Trusted Judgment 1m 41sIQM Untrusted Judgment 1m 11sIQM Task by Trusted Contributors 8m 27sIQM Task by Untrusted Contributors 5m 58s

(d) Inter Quantile Mean (IQM)temporal statistics

Job Aspects out of 5

Overall 2.9Instruction Clear 2.7Ease of Job 2.6Pay 2.6Test Question Fair 2.8

(e) Contributor sat-isfaction

Length in Words Count

5 15296 to 10 56911 to 15 11216 to 20 38> 20 29

(f) Reason textdistribution

Fig. 9: Experiment Results

first experiment, contributors find the more number of randomly shuffled summaries tobe more coherent than the proximity altered ones. Upon closer examination, we find thatin some randomly shuffled summaries few sentences retain their proximity by chance.These are marked as more coherent.

4.3 Experiment 3: Feasibility of using Crowdflower

The main objective of this experiment is to evaluate the viability of conducting a Crowd-flower study for evaluating a summarization task. To analyze the difficulty of the job,we ask the contributors to specify the difficulty level for each unit they judge. In theuser interface this is given as a set of five radio buttons representing different diffi-culty levels. In addition, we analyze the interestingness of the job, rate of progress, andcontributor satisfaction based on a survey provided by Crowdflower.Results obtained for this experiment are illustrated in Fig. 9. Firstly, we look into thedistribution of the difficulty for each unit specified by the contributors. Overall the judg-ments, we find that 13% of the units are marked as very easy, 33% are easy, 39% arenot so easy, 10% are difficult, and only 3.9% are marked as very difficult. A closerexamination of this distribution for each of the unit subsets are illustrated in Fig. 9a.

The entire job took approximately 77 hours to complete. Fig. 9b illustrates the rateof the judgments acquired over this time interval as obtained from Crowdflower. Asshown in Table 9d, a single trusted judgment took about 1 minutes and 41 seconds onan average. An untrusted judgment takes comparable time of about 1 minutes and 11

/ 32Impact of Order and Proximity on Readability (Klaus Berberich) 28

Results 3: Time

๏ CrowdFlower Job took 77 hours to complete

No.

of J

udgm

ents

10 How do Order and Proximity Impact the Readability of Event Summaries?

12 12 15 6 8 6

117 127 12983 92 108

128 132 122159 148 132

38 22 25 36 42 355 7 9 15 10 18

050

100150200

OR OC OS CS CR RSN

o. o

f Jud

gmen

ts

very easy easy not so easy difficult very difficult

O vs S R vs SO vs R P vs O P vs S P vs R(a) Distribution of the Difficulty level submitted by the contributors across the unit subsets.

No.

of J

udgm

ents

(b) Illustration of Judgments per hour (Time zone: GMT+2)Quiz Mode (passed) 55Quiz Mode (failed) 79Work Mode (passed) 32Work Mode (failed) 23

Trusted Judgments 2099Untrusted Judgments 414

(c) Contributor fun-nel statistics

Description Time

Job Run 77 hoursIQM Trusted Judgment 1m 41sIQM Untrusted Judgment 1m 11sIQM Task by Trusted Contributors 8m 27sIQM Task by Untrusted Contributors 5m 58s

(d) Inter Quantile Mean (IQM)temporal statistics

Job Aspects out of 5

Overall 2.9Instruction Clear 2.7Ease of Job 2.6Pay 2.6Test Question Fair 2.8

(e) Contributor sat-isfaction

Length in Words Count

5 15296 to 10 56911 to 15 11216 to 20 38> 20 29

(f) Reason textdistribution

Fig. 9: Experiment Results

first experiment, contributors find the more number of randomly shuffled summaries tobe more coherent than the proximity altered ones. Upon closer examination, we find thatin some randomly shuffled summaries few sentences retain their proximity by chance.These are marked as more coherent.

4.3 Experiment 3: Feasibility of using Crowdflower

The main objective of this experiment is to evaluate the viability of conducting a Crowd-flower study for evaluating a summarization task. To analyze the difficulty of the job,we ask the contributors to specify the difficulty level for each unit they judge. In theuser interface this is given as a set of five radio buttons representing different diffi-culty levels. In addition, we analyze the interestingness of the job, rate of progress, andcontributor satisfaction based on a survey provided by Crowdflower.Results obtained for this experiment are illustrated in Fig. 9. Firstly, we look into thedistribution of the difficulty for each unit specified by the contributors. Overall the judg-ments, we find that 13% of the units are marked as very easy, 33% are easy, 39% arenot so easy, 10% are difficult, and only 3.9% are marked as very difficult. A closerexamination of this distribution for each of the unit subsets are illustrated in Fig. 9a.

The entire job took approximately 77 hours to complete. Fig. 9b illustrates the rateof the judgments acquired over this time interval as obtained from Crowdflower. Asshown in Table 9d, a single trusted judgment took about 1 minutes and 41 seconds onan average. An untrusted judgment takes comparable time of about 1 minutes and 11

Trusted judgments take more time

/ 32Impact of Order and Proximity on Readability (Klaus Berberich) 28

Results 3: Time

๏ CrowdFlower Job took 77 hours to complete

No.

of J

udgm

ents

10 How do Order and Proximity Impact the Readability of Event Summaries?

12 12 15 6 8 6

117 127 12983 92 108

128 132 122159 148 132

38 22 25 36 42 355 7 9 15 10 18

050

100150200

OR OC OS CS CR RSN

o. o

f Jud

gmen

ts

very easy easy not so easy difficult very difficult

O vs S R vs SO vs R P vs O P vs S P vs R(a) Distribution of the Difficulty level submitted by the contributors across the unit subsets.

No.

of J

udgm

ents

(b) Illustration of Judgments per hour (Time zone: GMT+2)Quiz Mode (passed) 55Quiz Mode (failed) 79Work Mode (passed) 32Work Mode (failed) 23

Trusted Judgments 2099Untrusted Judgments 414

(c) Contributor fun-nel statistics

Description Time

Job Run 77 hoursIQM Trusted Judgment 1m 41sIQM Untrusted Judgment 1m 11sIQM Task by Trusted Contributors 8m 27sIQM Task by Untrusted Contributors 5m 58s

(d) Inter Quantile Mean (IQM)temporal statistics

Job Aspects out of 5

Overall 2.9Instruction Clear 2.7Ease of Job 2.6Pay 2.6Test Question Fair 2.8

(e) Contributor sat-isfaction

Length in Words Count

5 15296 to 10 56911 to 15 11216 to 20 38> 20 29

(f) Reason textdistribution

Fig. 9: Experiment Results

first experiment, contributors find the more number of randomly shuffled summaries tobe more coherent than the proximity altered ones. Upon closer examination, we find thatin some randomly shuffled summaries few sentences retain their proximity by chance.These are marked as more coherent.

4.3 Experiment 3: Feasibility of using Crowdflower

The main objective of this experiment is to evaluate the viability of conducting a Crowd-flower study for evaluating a summarization task. To analyze the difficulty of the job,we ask the contributors to specify the difficulty level for each unit they judge. In theuser interface this is given as a set of five radio buttons representing different diffi-culty levels. In addition, we analyze the interestingness of the job, rate of progress, andcontributor satisfaction based on a survey provided by Crowdflower.Results obtained for this experiment are illustrated in Fig. 9. Firstly, we look into thedistribution of the difficulty for each unit specified by the contributors. Overall the judg-ments, we find that 13% of the units are marked as very easy, 33% are easy, 39% arenot so easy, 10% are difficult, and only 3.9% are marked as very difficult. A closerexamination of this distribution for each of the unit subsets are illustrated in Fig. 9a.

The entire job took approximately 77 hours to complete. Fig. 9b illustrates the rateof the judgments acquired over this time interval as obtained from Crowdflower. Asshown in Table 9d, a single trusted judgment took about 1 minutes and 41 seconds onan average. An untrusted judgment takes comparable time of about 1 minutes and 11

/ 32Impact of Order and Proximity on Readability (Klaus Berberich) 28

Results 3: Time

๏ CrowdFlower Job took 77 hours to complete

No.

of J

udgm

ents

10 How do Order and Proximity Impact the Readability of Event Summaries?

12 12 15 6 8 6

117 127 12983 92 108

128 132 122159 148 132

38 22 25 36 42 355 7 9 15 10 18

050

100150200

OR OC OS CS CR RSN

o. o

f Jud

gmen

ts

very easy easy not so easy difficult very difficult

O vs S R vs SO vs R P vs O P vs S P vs R(a) Distribution of the Difficulty level submitted by the contributors across the unit subsets.

No.

of J

udgm

ents

(b) Illustration of Judgments per hour (Time zone: GMT+2)Quiz Mode (passed) 55Quiz Mode (failed) 79Work Mode (passed) 32Work Mode (failed) 23

Trusted Judgments 2099Untrusted Judgments 414

(c) Contributor fun-nel statistics

Description Time

Job Run 77 hoursIQM Trusted Judgment 1m 41sIQM Untrusted Judgment 1m 11sIQM Task by Trusted Contributors 8m 27sIQM Task by Untrusted Contributors 5m 58s

(d) Inter Quantile Mean (IQM)temporal statistics

Job Aspects out of 5

Overall 2.9Instruction Clear 2.7Ease of Job 2.6Pay 2.6Test Question Fair 2.8

(e) Contributor sat-isfaction

Length in Words Count

5 15296 to 10 56911 to 15 11216 to 20 38> 20 29

(f) Reason textdistribution

Fig. 9: Experiment Results

first experiment, contributors find the more number of randomly shuffled summaries tobe more coherent than the proximity altered ones. Upon closer examination, we find thatin some randomly shuffled summaries few sentences retain their proximity by chance.These are marked as more coherent.

4.3 Experiment 3: Feasibility of using Crowdflower

The main objective of this experiment is to evaluate the viability of conducting a Crowd-flower study for evaluating a summarization task. To analyze the difficulty of the job,we ask the contributors to specify the difficulty level for each unit they judge. In theuser interface this is given as a set of five radio buttons representing different diffi-culty levels. In addition, we analyze the interestingness of the job, rate of progress, andcontributor satisfaction based on a survey provided by Crowdflower.Results obtained for this experiment are illustrated in Fig. 9. Firstly, we look into thedistribution of the difficulty for each unit specified by the contributors. Overall the judg-ments, we find that 13% of the units are marked as very easy, 33% are easy, 39% arenot so easy, 10% are difficult, and only 3.9% are marked as very difficult. A closerexamination of this distribution for each of the unit subsets are illustrated in Fig. 9a.

The entire job took approximately 77 hours to complete. Fig. 9b illustrates the rateof the judgments acquired over this time interval as obtained from Crowdflower. Asshown in Table 9d, a single trusted judgment took about 1 minutes and 41 seconds onan average. An untrusted judgment takes comparable time of about 1 minutes and 11

Contributors by-and-large happy with job

/ 32Impact of Order and Proximity on Readability (Klaus Berberich) 28

Results 3: Time

๏ CrowdFlower Job took 77 hours to complete

No.

of J

udgm

ents

10 How do Order and Proximity Impact the Readability of Event Summaries?

12 12 15 6 8 6

117 127 12983 92 108

128 132 122159 148 132

38 22 25 36 42 355 7 9 15 10 18

050

100150200

OR OC OS CS CR RSN

o. o

f Jud

gmen

ts

very easy easy not so easy difficult very difficult

O vs S R vs SO vs R P vs O P vs S P vs R(a) Distribution of the Difficulty level submitted by the contributors across the unit subsets.

No.

of J

udgm

ents

(b) Illustration of Judgments per hour (Time zone: GMT+2)Quiz Mode (passed) 55Quiz Mode (failed) 79Work Mode (passed) 32Work Mode (failed) 23

Trusted Judgments 2099Untrusted Judgments 414

(c) Contributor fun-nel statistics

Description Time

Job Run 77 hoursIQM Trusted Judgment 1m 41sIQM Untrusted Judgment 1m 11sIQM Task by Trusted Contributors 8m 27sIQM Task by Untrusted Contributors 5m 58s

(d) Inter Quantile Mean (IQM)temporal statistics

Job Aspects out of 5

Overall 2.9Instruction Clear 2.7Ease of Job 2.6Pay 2.6Test Question Fair 2.8

(e) Contributor sat-isfaction

Length in Words Count

5 15296 to 10 56911 to 15 11216 to 20 38> 20 29

(f) Reason textdistribution

Fig. 9: Experiment Results

first experiment, contributors find the more number of randomly shuffled summaries tobe more coherent than the proximity altered ones. Upon closer examination, we find thatin some randomly shuffled summaries few sentences retain their proximity by chance.These are marked as more coherent.

4.3 Experiment 3: Feasibility of using Crowdflower

The main objective of this experiment is to evaluate the viability of conducting a Crowd-flower study for evaluating a summarization task. To analyze the difficulty of the job,we ask the contributors to specify the difficulty level for each unit they judge. In theuser interface this is given as a set of five radio buttons representing different diffi-culty levels. In addition, we analyze the interestingness of the job, rate of progress, andcontributor satisfaction based on a survey provided by Crowdflower.Results obtained for this experiment are illustrated in Fig. 9. Firstly, we look into thedistribution of the difficulty for each unit specified by the contributors. Overall the judg-ments, we find that 13% of the units are marked as very easy, 33% are easy, 39% arenot so easy, 10% are difficult, and only 3.9% are marked as very difficult. A closerexamination of this distribution for each of the unit subsets are illustrated in Fig. 9a.

The entire job took approximately 77 hours to complete. Fig. 9b illustrates the rateof the judgments acquired over this time interval as obtained from Crowdflower. Asshown in Table 9d, a single trusted judgment took about 1 minutes and 41 seconds onan average. An untrusted judgment takes comparable time of about 1 minutes and 11

/ 32Impact of Order and Proximity on Readability (Klaus Berberich) 29

Results 3: Contributor Comments

8 How do Order and Proximity Impact the Readability of Event Summaries?

82

18

93

7

4357

0102030405060708090100

O R O W R S

(a) (b) (c)N

umbe

r of

Uni

tsO vs S R vs SO vs R

O R O R SS

Fig. 7: Experiment 1 Results

89

1141

59 51 49

0102030405060708090100

O C C S C R

(a) (b) (c)

Num

ber

of U

nits

PPPO S RP vs O P vs S P vs R

Fig. 8: Experiment 2 Results

Ex. Unit Id Sum. A Sum. B Reason

1 1052032164 O S Preference: O. Reason: The story makes sense. First, talk about the date, then about the consequences ofthe attacks and finish about the attack itself. B jumps from one issue to other, the link does not make sensesometimes.

2 1052262816 O R Preference: O. Reason: The summary A describes correctly the order in which the ministers of economywere named and replaced, while summary B is talking about what the third minister of economy madewithout referring to his predecessors.

3 1051055147 S R Preference: R. Reason: Again, difficult to choose but B starts with the cut of the power and finish withthe restored, explaining the story in between.

4 1052081752 P O Preference: O. Reason: the order makes sense. It starts with the flight, the number of passengers and thentalk about the flight. In text A, the author jump from the flight to the pilot to come back to the airplane tocome back to the pilot.

5 1051055286 R P Preference: P . Reason: The two summaries have all paragraphs in wrong order. Both describes the causesof the accident at the end of the text when it should be at the beginning and both of them speaks about thedoubts on the number of casualties before saying the official report of such amount.

6 1051054921 S P Preference: S. Reason: The “tower commission” is the main element around which everything revolvesaround in these summaries. In the Summary A. any sentence about ”tower commission” is near the otherabout it so that’s why I chose it..

Table 2: Hand-picked examples of Reason Box text. Unit Id in the second column linksto the released data for further reference of the readers. Summary A and B presented tocontributor from the set indicated. The last column specifies the summary set that thecontributor finds more coherent with the reason for the judgment.

4.1 Experiment 1: Impact of Sentence Order

The main objective of this experiment is to analyze the impact of sentence order on thereadability and coherence of fixed-length text summaries on past events. Summariesfrom the set O, written by Wikipedians, should be most coherent. Reversing the order ofthe sentences as in set R, should drastically affect the coherence of the text. We generate300 units that pair-wise compare summaries from the sets O, R, and S . A comparisonwith the set S summaries acts as a random test. We refer to the corresponding subsets,each containing 100 units as O vs R, O vs S , and R vs S .Results of our experiment are illustrated in Fig. 7. The final preference label of a unitis selected based on majority voting with three judgments. Across all the subsets undercomparison, the contributors judge summaries from O to be the most coherent. Amongthe 100 units in O vs R, 82 units set O summary are more coherent. For the O vs Ssubset, in 93 units the original summary was found to be better. We obtain an interestingresult for the subset R vs S where in 57 units, the randomly shuffled set S summarywas found to be more coherent. We found moderate agreement for the subsets O vs Rand O vs S , and fair agreement for R vs S with Fleiss’ kappa scores as 0.42, 0.47, and0.19 respectively. The longest text obtained from Reason Box across the 300 units underconsideration consists of 54 words. However, the shortest description for a judgment isfound to be just one word. The average length is 5.6 words.

/ 32Impact of Order and Proximity on Readability (Klaus Berberich) 30

Outline

๏ Motivation

๏ Research Questions

๏ Crowdsourcing Task

๏ Results

๏ Summary

/ 32Impact of Order and Proximity on Readability (Klaus Berberich) 31

Summary๏ Empirical study conducted on CrowdFlower to examine

whether sentence order and sentence proximity impact the readability of news summaries

๏ Results show that sentence order and sentence proximity do have an effect on the readability of news summaries

๏ Dataset (including summaries, preference judgments, and job data) is publicly available (please cite):

๏ http://resources.mpi-inf.mpg.de/d5/txtCoherence/

/ 32Impact of Order and Proximity on Readability (Klaus Berberich) 32

Thank You! Questions?