Effective Information Access Over Public Email Archives Progress Report William Lee, Hui Fang, Yifan Li For CS598CXZ Spring 2005

Effective Information Access Over Public Email Archives

Progress Report

William Lee, Hui Fang, Yifan Li

For CS598CXZ Spring 2005

Introduction and Motivation

Information within a newsgroup or a mailing list has largely been underutilized.

For now, access to those data restricted to traditional search and browsing.

Mail traffic also grows rapidly For example, the Tomcat (the Java-based web

application engine) mailing list has more than 37,000 messages from March 2003 to March 2004. That’s around 101 messages per day!

Can we access those information more effectively?

Existing TechnologiesSearch Browse

Project Goals

Thread Detection Detects topic shift within a thread Challenge:

W can not find such cases in our collection. So we will not explore it in our projects. But it is still a quite interesting research question in email domain.

Clustering Group the similar threads together Challenges:

How to define the similarity function between two threads? How to evaluate the clustering results?

Summarizing Generate the summary for each cluster Challenge:

How to identify the important part in each cluster? How to evaluate the summarization results?

Interface to view the clustering result

The Corpus

Newsgroup archive for 3 Computer Science classes (CS473, CS475, and CS225) at UIUC for Fall 2004.

Each newsgroup contains messages for a complete semester for the given class.

Unlike previous newsgroup clustering tasks: Use thread instead of an individual message as the unit. We cluster based on subtopics within a newsgroup

Progress So Far

Implemented clustering by using the CEES (Conversation Extraction and Evaluation Service) architecture CEES provides an architecture to

Gather messages and construct thread trees Parse, index, search, and cluster threads Integration with Lucene and Weka Cluster threads using different fields

Created the judgment files for evaluating the clustering results manually

Clustering Use agglomerative clustering algorithm Similarity = dot product of Okapi-weighted vectors of

corresponding fields Computes the similarity of:

Contents Subject Contents without quote First message Rest of thread Rest of thread without quote Participants in a thread (email addresses in the “From:”) Linear regression using all the above features Logistic regression using all the above features

Overall Entropy=0.5*Cluster Entropy + 0.5*Class Entropy

Cluster Quality Measures (He2002)

12

34

5

123

45

Cluster Entropy Class Entropy

Result Actual

Clustering Performance

Cluster Entropy Class Entropy

Clustering Performance(2)

Overall Entropy=0.53*Cluster Entropy + 0.47*Class Entropy

Remaining Work Clustering

Find a more reasonable cluster quality measure Study why sometimes learned similarity function

performs worse than baseline Find a better way to learn the similarity function

Summarization Divide it into two subtasks

Summarization of announcement-driven discussion Summarization of question-driven discussion

Evaluation Create judgement files Evaluation measures

Documents

Effective Information Access Over Public Email Archives Progress Report William Lee, Hui Fang, Yifan Li For CS598CXZ Spring 2005