BuzzTrack Topic Detection and Tracking in Email

Preview:

DESCRIPTION

BuzzTrack Topic Detection and Tracking in Email. IUI – Intelligent User Interfaces January 2007. Gabor Cselle Google gabor@google.com. Keno Albrecht ETH Zurich kenoa@tik.ee.ethz.ch. Roger Wattenhofer ETH Zurich wattenhofer@tik.ee.ethz.ch. Email Overload. - PowerPoint PPT Presentation

Citation preview

BuzzTrackTopic Detection and Tracking in Email

IUI – Intelligent User InterfacesJanuary 2007

Keno AlbrechtETH Zurich

kenoa@tik.ee.ethz.ch

Roger WattenhoferETH Zurich

wattenhofer@tik.ee.ethz.ch

Gabor CselleGoogle

gabor@google.com

2

Email Overload• Email clients were not designed to

handle volume and variety of messages users are dealing with today:

• Large volumes of email• Task Management• Personal Archiving or Filing• Keeping Context

[Whittaker and Sidner, 1996]

3

Search vs. Inbox Browsing• Fast full-text search

is today's solution to finding past emails.

• But the flat inbox view of newly incoming emails hasn’t changed.

In our work, we focus on the problem of sensibly structuring emails in the inbox.

4

Today's Email Clients: The Three-Pane View

No sense of context: unrelated messages are shown together

Important emails may drop off the “first screen”

“Thread-based” tree views are unsophisticated, may not pull in all relevant messages.

5

BuzzTrackEmail client extension for Mozilla Thunderbirdfor displaying email grouped by topic.

6

Related Work

7

Visualizations: ConversationsGmail (Google)

common conversation title

one entry per email, folds out on click

8

Automatic Foldering• Using machine learning

techniques to automatically move emails into folders upon arrival

• Low accuracy rates [Bekkerman et al, 2005], conceptual problems:• Users need to manually

create folders and seed them with data.

9

People-Centered Email Clients

Bifrost ContactMap

[Bälter and Sidner, 2002] [Whittaker et al., 2004]

10

Task-based Email

Example: TaskMaster

thrasks

thrask contents

item contents

(emails, documents, etc.)

TaskMaster[Belotti et al., 2003]

11

BuzzTrack

12

BuzzTrack• Mozilla Thunderbird

extension to automatically group related emails into topics.

• Will be distributed through website: www.buzztrack.net

• Provides a view on the user’s inbox.

13

What’s a Topic?

• Topics are groups of emails that relate to the same idea, action, event, task, or question.

• Examples:•A conversation about buying a

digital camera.•Referring a candidate for a job.•All emails belonging to same

newsgroup.

14

Clustering Process• For every new incoming email:

Preprocessing Clustering

Label generation

Cluster storeBuzzTrack View in

Thunderbird

15

Preprocessing• Tokenization (remove HTML tags, style

sheets, punctuation, and numbers)• Language detection• Stemming• For topic labelling:

• Identify Parts-of-speech• Remember popular original word

forms

16

Clustering• Single-link clustering: Newly incoming emails are

compared to every email in existing topics:• Similarity value > threshold: assigned to topic• Similarity value <= threshold: email starts new topic

Topic 1 Topic 2

Topic 3

new email

17

Features - 1• How do we generate similarity values

between emails?• Via a linear combination of several

similarity features. • Examples:

• Text similarity (TFIDF Value, cosine similarity metric)

• People similarities (comparing sets of people in the From / To / Cc lines of email headers)

• Thread membership

18

Features - 2Other features for deriving similarities:• Subject similarity• Sender domain overlaps• Sender rank and percentage• % of email from sender that is

answered• Time passed since last email in topic• People and reference count for email• Known people and reference %• Cluster size• Has attachment

19

Decision Score

Similarities are combined into a decision score for each email / cluster pair through a linear combination of feature values:deci,j = wa*sima(mi,Cj) + wb*simb(mi,Cj) + …

We tested two sets of weights wx, both trained on a development set of emails:

• Empirical• Linear SVM

20

Evaluation• How do we evaluate clustering quality?• Topic Detection and Tracking

competitions by NIST. Aimed at clustering news articles.

• Corpus:

21

Clustering Tasks• Clustering Task is split into subtasks:

• New Topic Detection (NTD):Given stream of emails, which ones start new topics?

• Topic Tracking (TT):Given a fixed topic, which newly incoming emails belong to it?

• DET Curves plot miss rate vs. false alarm rate for possible threshold for decision scores

22

Results NTD• TDT New Topic Detection Task

Miss: 3%False alarm: 30%

bett

er

better

23

Results TT• TDT Topic Tracking Task

Miss: 8%False alarm: 2%

bett

er

better

24

Comparison• Comparable quality to TDT for news

articles [NIST 2004]• News has less metadata, email has

worse text quality.• Wide body of work exists on improving

clustering performance on news, we haven’t tapped into that yet.

25

BuzzTrack View

• Mozilla Thunderbird plugin that provides useful view on inbox data “for free”

• Topics contain email from last 60 days• We’re interested in current email

only• Reduces initial clustering time

• Each email is shown in one topic

26

27

Demo 1: BuzzTrack

28

BuzzTrack PanesTopic pane: • Provides additional

info• Starred topics

Email pane:• Topics sorted by last

incoming email

29

Future Work• Distribute plugin to Thunderbird users

• Input on possible UI improvements• Input on clustering quality

• Different clustering styles• People-based• Thread-based

• We hope BuzzTrack will be valuable tool for real-world users

30

Questions?

Contact: Gabor Cselle, mail@gaborcselle.com

Website:www.buzztrack.net

Recommended