Upload
barnard-booker
View
214
Download
0
Embed Size (px)
DESCRIPTION
Papers 1.PEEP- An Information Extraction base approach for Privacy Protection in Boufaden, Elazmeh, Ma, Stan Matwin, El-Kadri, JapkowiczPEEP- An Information Extraction base approach for Privacy Protection in 2.The Social Network and Relationship Finder: Social Sorting for Triage Carman Neustaedter, A.J. Bernheim Brush, Marc A. Smith, Danyel FisherThe Social Network and Relationship Finder: Social Sorting for Triage 3. Task Management: An Iterative Relational Learning Approach Rinat Khoussainov, Nicholas Kushmerick Task Management: An Iterative Relational Learning Approach
Citation preview
Email Conference 2005 Overview
• 26 papers (69 submitted), approx. 150 people attended (in 2004, 29 papers out of 80 submissions, 180 people attended)...AAAI effect
• More Email, less Spam papers. Number of Microsoft papers: 7
• Same size (2 days), Same place. In 2006: same place and size, no one from MS as chair.
Spam Papers
1. Spam Corpus Creation for TREC , Gordon Cormack, Thomas Lynam 1. Competition starting in 20052. Preliminary results using Enron corpus
2. Comparative Graph Theoretical Characterization of Networks of Spam Gomes, R. Almeida, Bettencourt, V. Almeida, J. Almeida
3. Spam Deobfuscation using a Hidden Markov Model, Honglak Lee, Andrew Ng
Email Papers1. PEEP- An Information Extraction base approach for Privacy Protection in
Email Boufaden, Elazmeh, Ma, Stan Matwin, El-Kadri, Japkowicz
2. The Social Network and Relationship Finder: Social Sorting for Email Triage Carman Neustaedter, A.J. Bernheim Brush, Marc A. Smith, Danyel Fisher
3. Email Task Management: An Iterative Relational Learning Approach Rinat Khoussainov, Nicholas Kushmerick
Comparative Graph Theoretical Characterization of Networks of Spam Gomes, R. Almeida, Bettencourt, V. Almeida, J. Almeida
• 2 graphs: User Graph and Domain Graph
• Large dataset (615K msgs)• Different Metrics:
– Average Clustering Coefficient– Prob. Finding a node during a
random walk– etc– Communication Reciprocity CR
• Practical results?
Spam Deobfuscation using a Hidden Markov Model, Honglak Lee, Andrew Ng
• Spammers obfuscate emails by deliberately using misspellings, typos, etc (Table Below)
• There are anti-spam systems using RegExp to detect obfuscated words. Not robust, low recall.
• Idea: build an HMM robust to some types of obfuscation: misspellings, adding/removing spaces, substitution/insertion of non-alphabetical chars
Spam Deobfuscation using a Hidden Markov Model, Honglak Lee, Andrew Ng
1. First Model: Lexicon Tree
• 45K words in dictionary• 111K states• Emission set has 70 chars: 26
letters + space + other ASCII chars (*,/,-,+, etc)
• Sf links to So
• Self-transitions: substitutions and insertions
• Epsilon-transitions: deletions• Parameters to control self and
epsilon transitions
Spam Deobfuscation using a Hidden Markov Model, Honglak Lee, Andrew Ng
1. 2nd Model: Out-of-Dictionary HMM
• Both models use Beam search to decode.
Spam Deobfuscation using a Hidden Markov Model, Honglak Lee, Andrew Ng
• Results
PEEP- An Information Extraction base approach for Privacy Protection in Email
Boufaden, Elazmeh, Ma, Stan Matwin, El-Kadri, Japkowicz
• idea: monitor outgoing emails for potential privacy breaches in a university
• 4 parts of architecture:– Preprocessing (segmentation,
abreviations, verb-object, from, to, etc)
– IE: extracts private info (Grades, names, addresses, IDs, etc)
– Ontology and roles (student, attributes of student, professor, dean, secretary, course, etc)
– Violation Detection (set of privacy rules)
PEEP- An Information Extraction base approach for Privacy Protection in Email
Boufaden, Elazmeh, Ma, Stan Matwin, El-Kadri, Japkowicz
• Domain Knowledge
• Info Access Ontology
PEEP- An Information Extraction base approach for Privacy Protection in Email
Boufaden, Elazmeh, Ma, Stan Matwin, El-Kadri, Japkowicz
• Info Extraction System1. Shallow Parsing
– CASS partial parser (Abney) and Brill’s POS tagger.
2. Semantic Tagging– List of words related to 3 classes: Verb-Score (score, receive, rank, etc),
Assignment (mark, test, exam, etc) and ID (identification number, student ID, etc). In a small test, this tagger had F1 of 95%.
3. Individual Facts– “It uses Markov models to learn relevant sequences of semantic tags alogn with
their semantic role. This stage allows the detection of the target relation “the Assignment mark X of student Y””. Extracts the facts X and Y from the semantic tag sequence learned.
– Output: set of relations and facts in prolog format
PEEP- An Information Extraction base approach for Privacy Protection in Email
Boufaden, Elazmeh, Ma, Stan Matwin, El-Kadri, Japkowicz
• Overall Results:
The Social Network and Relationship Finder: Social Sorting for Email Triage C. Neustaedter, A.J. Bernheim Brush, M. A. Smith, D. Fisher
• SNARF (social network and relationship finder)• Social sorting: using social metrics to bring important
emails to the top. Metrics: sent and received
The Social Network and Relationship Finder: Social Sorting for Email Triage C. Neustaedter, A.J. Bernheim Brush, M. A. Smith, D. Fisher
• Person-centric visualization
• Social importance can be decided by the many metrics or manually
• Importance can be time dependent
Email Task Management: An Iterative Relational Learning Approach Rinat Khoussainov, Nicholas Kushmerick
• Combine relations identification with Combine relations identification with speech actsspeech acts learning, but for free-text (human-generated) email:
– RelationshipsRelationships between messages between messages in the same task provide additional contextcontext to each message that can help to identify speech actshelp to identify speech acts
– Speech actsSpeech acts can help to find relationshipshelp to find relationships between messages and, subsequently, group them into tasks
Slide from CEAS-05, Khoussainov+Kushmerick
Email Task Management: An Iterative Relational Learning Approach Rinat Khoussainov, Nicholas Kushmerick
• Initial relationsInitial relations::– text similarity + structured info (subject, send time difference)
• Initial speech actsInitial speech acts::– bag-of-words, SVM
• Using speechUsing speech acts acts to clarifyto clarify relations relations:– identify potential parents for each message (as above)– use a classifier (SVM) trained on similarity and speech acts to prune
the links
• Using rUsing relations elations to classifyto classify speech acts speech acts::– use speech acts of related (surrounding) messages as extrinsic
features in an iterative relational classification algorithmiterative relational classification algorithm
Slide from CEAS-05, Khoussainov+Kushmerick
Email Task Management: An Iterative Relational Learning Approach Rinat Khoussainov, Nicholas Kushmerick
• Speech actsSpeech acts– Set of binary classification problems for each act– Kappa statistics measure (to account for imbalance in data)– Initial: at “0”– During 1st iteration: 1..9– During 2nd iteration remained the same
• RelationsRelations– Initial: P=R=F1=0.95– After 1st iteration: P=1.0; R=0.95; F1=0.98– After 2nd iteration: remained the same
Slide from CEAS-05, Khoussainov+Kushmerick