Upload
juliet-freeman
View
212
Download
0
Embed Size (px)
Citation preview
Effective Information Access Over Public Email Archives
Progress Report
William Lee, Hui Fang, Yifan Li
For CS598CXZ Spring 2005
Introduction and Motivation
Information within a newsgroup or a mailing list has largely been underutilized.
For now, access to those data restricted to traditional search and browsing.
Mail traffic also grows rapidly For example, the Tomcat (the Java-based web
application engine) mailing list has more than 37,000 messages from March 2003 to March 2004. That’s around 101 messages per day!
Can we access those information more effectively?
Existing TechnologiesSearch Browse
Project Goals
Thread Detection Detects topic shift within a thread Challenge:
W can not find such cases in our collection. So we will not explore it in our projects. But it is still a quite interesting research question in email domain.
Clustering Group the similar threads together Challenges:
How to define the similarity function between two threads? How to evaluate the clustering results?
Summarizing Generate the summary for each cluster Challenge:
How to identify the important part in each cluster? How to evaluate the summarization results?
Interface to view the clustering result
The Corpus
Newsgroup archive for 3 Computer Science classes (CS473, CS475, and CS225) at UIUC for Fall 2004.
Each newsgroup contains messages for a complete semester for the given class.
Unlike previous newsgroup clustering tasks: Use thread instead of an individual message as the unit. We cluster based on subtopics within a newsgroup
Progress So Far
Implemented clustering by using the CEES (Conversation Extraction and Evaluation Service) architecture CEES provides an architecture to
Gather messages and construct thread trees Parse, index, search, and cluster threads Integration with Lucene and Weka Cluster threads using different fields
Created the judgment files for evaluating the clustering results manually
Clustering Use agglomerative clustering algorithm Similarity = dot product of Okapi-weighted vectors of
corresponding fields Computes the similarity of:
Contents Subject Contents without quote First message Rest of thread Rest of thread without quote Participants in a thread (email addresses in the “From:”) Linear regression using all the above features Logistic regression using all the above features
Overall Entropy=0.5*Cluster Entropy + 0.5*Class Entropy
Cluster Quality Measures (He2002)
12
34
5
123
45
Cluster Entropy Class Entropy
Result Actual
Clustering Performance
Cluster Entropy Class Entropy
Clustering Performance(2)
Overall Entropy=0.53*Cluster Entropy + 0.47*Class Entropy
Remaining Work Clustering
Find a more reasonable cluster quality measure Study why sometimes learned similarity function
performs worse than baseline Find a better way to learn the similarity function
Summarization Divide it into two subtasks
Summarization of announcement-driven discussion Summarization of question-driven discussion
Evaluation Create judgement files Evaluation measures