Conversation Conversation Disentanglement in Disentanglement in
Sports DiscourseSports Discourse
Anthony Wong6/01/11
Importance of TopicImportance of TopicWhat is conversation disentanglement?
◦Clustering task, diving a transcript into a number of smaller, separate conversations
Conversation disentanglement has a couple practical applications:◦Summary generation◦User-interface systems like automatic
threading
Basis of my ApproachBasis of my Approach
Michael Elsner and Eugene Charniak (2008)◦Uses lexical and non-lexical features
to cluster different threads Time between utterances, same
speaker, number of shared words, “content” words
Proposed Project Proposed Project OverviewOverviewFollow the methodology in Elsner and
Charniak’s paper◦Create and annotate a dataset of sports
discourseUse existing Elsner/Charniak model to
provide a baseline classification results and see how well their model adapts to a different chat domain
Test out different feature combination to hopefully raise performance
? – Compare results with Elsner/Charniak paper in some meaningful way
Progress so farProgress so far
Retrieving and preparing Retrieving and preparing datadata
Retrieving and preparing Retrieving and preparing datadata
Annotating the dataAnnotating the data
Annotating the dataAnnotating the data
T1 715 KateC : Sam - this is going to be painful, isn't it? T1 715 SamHolako : I hope not Kate, but Howard, Nelson and Carter have killed the Raptors in the past T2 715 JaredWade : Classic Frisco. The Minnesota bathroom smells worse, I hear. T3 715 Anthony(RapsFan) : @Batman: His WP48 is the worst on the team. Andrea is terrible. He scores. That's about it. T3 715 Arnold : Holy impossibilities , Batman - that won't happen. T4 715 BretLaGree : Raja Bell and Mike Bibby just held a flop-off in the lane. Bell won. T5 715 Bobbo : Zach, Go hit up Cinnabun!!! worth the $$...write it off to ESPN anyway T5 715 ZachHarper : I don't think it works that way T6 715 Aras : Jared! T6 715 JaredWade : Aras.
Annotating the dataAnnotating the dataThe annotated part of this transcript
has 399 lines.177 unique threads.The average conversation length is
2.25423728814 .The median conversation length is 2 .The entropy is 7.0155726118 bits.The median chat has 0.0 interruptions
per line.The average block of 10 contains
6.25706940874 threads.The line-averaged conversation density
is 2.77944862155 .
Running Elsner model as Running Elsner model as isis T1 715 KateC : Sam - this is going to be painful, isn't it? T2 715 SamHolako : I hope not Kate, but Howard,
Nelson and Carter have killed the Raptors in the past T3 715 JaredWade : Classic Frisco. The Minnesota
bathroom smells worse, I hear. T4 715 Anthony(RapsFan) : @Batman: His WP48 is the
worst on the team. Andrea is terrible. He scores. That's about it.
T5 715 Arnold : Holy impossibilities , Batman - that won't happen.
T6 715 BretLaGree : Raja Bell and Mike Bibby just held a flop-off in the lane. Bell won.
T7 715 Bobbo : Zach, Go hit up Cinnabun!!! worth the $$...write it off to ESPN anyway
T8 715 ZachHarper : I don't think it works that way T9 715 Aras : Jared! T9 715 JaredWade : Aras.
Running Elsner model as Running Elsner model as isis368 unique threads.The average conversation length is
1.08423913043 .The median conversation length is
1 .The entropy is 8.48485646504 bits.The median chat has 0.0
interruptions per line.The average block of 10 contains
9.52699228792 threads.The line-averaged conversation
density is 1.42355889724 .
Editing the model and Editing the model and evaluationevaluation
Still in progress◦A lot of room for improvement◦Many different feature combinations
to try
Need to get evaluation code running
IssuesIssuesDocumentation for Elsner code is
good, but my Python is not
Integration issues between my data and Elsner code
MEGA Model Optimization Package (megam)