Email Conference 2005 Overview 26 papers (69 submitted), approx. 150 people attended (in 2004, 29 papers out of 80 submissions, 180 people attended)...AAAI

Email Conference 2005 Overview

• 26 papers (69 submitted), approx. 150 people attended (in 2004, 29 papers out of 80 submissions, 180 people attended)...AAAI effect

• More Email, less Spam papers. Number of Microsoft papers: 7

• Same size (2 days), Same place. In 2006: same place and size, no one from MS as chair.

Spam Papers

1. Spam Corpus Creation for TREC , Gordon Cormack, Thomas Lynam 1. Competition starting in 20052. Preliminary results using Enron corpus

2. Comparative Graph Theoretical Characterization of Networks of Spam Gomes, R. Almeida, Bettencourt, V. Almeida, J. Almeida

3. Spam Deobfuscation using a Hidden Markov Model, Honglak Lee, Andrew Ng

http://www.ceas.cc/papers-2005/162.pdf





Email Papers1. PEEP- An Information Extraction base approach for Privacy Protection in

Email Boufaden, Elazmeh, Ma, Stan Matwin, El-Kadri, Japkowicz

2. The Social Network and Relationship Finder: Social Sorting for Email Triage Carman Neustaedter, A.J. Bernheim Brush, Marc A. Smith, Danyel Fisher

3. Email Task Management: An Iterative Relational Learning Approach Rinat Khoussainov, Nicholas Kushmerick

Comparative Graph Theoretical Characterization of Networks of Spam Gomes, R. Almeida, Bettencourt, V. Almeida, J. Almeida

• 2 graphs: User Graph and Domain Graph

• Large dataset (615K msgs)• Different Metrics:

– Average Clustering Coefficient– Prob. Finding a node during a

random walk– etc– Communication Reciprocity CR

• Practical results?

Spam Deobfuscation using a Hidden Markov Model, Honglak Lee, Andrew Ng

• Spammers obfuscate emails by deliberately using misspellings, typos, etc (Table Below)

• There are anti-spam systems using RegExp to detect obfuscated words. Not robust, low recall.

• Idea: build an HMM robust to some types of obfuscation: misspellings, adding/removing spaces, substitution/insertion of non-alphabetical chars


1. First Model: Lexicon Tree

• 45K words in dictionary• 111K states• Emission set has 70 chars: 26

letters + space + other ASCII chars (*,/,-,+, etc)

• Sf links to So

• Self-transitions: substitutions and insertions

• Epsilon-transitions: deletions• Parameters to control self and

epsilon transitions


1. 2nd Model: Out-of-Dictionary HMM

• Both models use Beam search to decode.


• Results

PEEP- An Information Extraction base approach for Privacy Protection in Email

Boufaden, Elazmeh, Ma, Stan Matwin, El-Kadri, Japkowicz

• idea: monitor outgoing emails for potential privacy breaches in a university

• 4 parts of architecture:– Preprocessing (segmentation,

abreviations, verb-object, from, to, etc)

– IE: extracts private info (Grades, names, addresses, IDs, etc)

– Ontology and roles (student, attributes of student, professor, dean, secretary, course, etc)

– Violation Detection (set of privacy rules)



• Domain Knowledge

• Info Access Ontology



• Info Extraction System1. Shallow Parsing

– CASS partial parser (Abney) and Brill’s POS tagger.

2. Semantic Tagging– List of words related to 3 classes: Verb-Score (score, receive, rank, etc),

Assignment (mark, test, exam, etc) and ID (identification number, student ID, etc). In a small test, this tagger had F1 of 95%.

3. Individual Facts– “It uses Markov models to learn relevant sequences of semantic tags alogn with

their semantic role. This stage allows the detection of the target relation “the Assignment mark X of student Y””. Extracts the facts X and Y from the semantic tag sequence learned.

– Output: set of relations and facts in prolog format



• Overall Results:

The Social Network and Relationship Finder: Social Sorting for Email Triage C. Neustaedter, A.J. Bernheim Brush, M. A. Smith, D. Fisher

• SNARF (social network and relationship finder)• Social sorting: using social metrics to bring important

emails to the top. Metrics: sent and received

The Social Network and Relationship Finder: Social Sorting for Email Triage C. Neustaedter, A.J. Bernheim Brush, M. A. Smith, D. Fisher

• Person-centric visualization

• Social importance can be decided by the many metrics or manually

• Importance can be time dependent

Email Task Management: An Iterative Relational Learning Approach Rinat Khoussainov, Nicholas Kushmerick

• Combine relations identification with Combine relations identification with speech actsspeech acts learning, but for free-text (human-generated) email:

– RelationshipsRelationships between messages between messages in the same task provide additional contextcontext to each message that can help to identify speech actshelp to identify speech acts

– Speech actsSpeech acts can help to find relationshipshelp to find relationships between messages and, subsequently, group them into tasks

Slide from CEAS-05, Khoussainov+Kushmerick


• Initial relationsInitial relations::– text similarity + structured info (subject, send time difference)

• Initial speech actsInitial speech acts::– bag-of-words, SVM

• Using speechUsing speech acts acts to clarifyto clarify relations relations:– identify potential parents for each message (as above)– use a classifier (SVM) trained on similarity and speech acts to prune

the links

• Using rUsing relations elations to classifyto classify speech acts speech acts::– use speech acts of related (surrounding) messages as extrinsic

features in an iterative relational classification algorithmiterative relational classification algorithm



• Speech actsSpeech acts– Set of binary classification problems for each act– Kappa statistics measure (to account for imbalance in data)– Initial: at “0”– During 1st iteration: 1..9– During 2nd iteration remained the same

• RelationsRelations– Initial: P=R=F1=0.95– After 1st iteration: P=1.0; R=0.95; F1=0.98– After 2nd iteration: remained the same


Documents

Email Conference 2005 Overview 26 papers (69 submitted), approx. 150 people attended (in 2004, 29 papers out of 80 submissions, 180 people attended)...AAAI