12
Email Sherlock: Using Machine Learning to Extract Information from Large Email Datasets Jay Gondin

Email sherlock: Using Machine Learning to Extract Information from Large Email Dataset

Embed Size (px)

Citation preview

Page 1: Email sherlock: Using Machine Learning to Extract Information from Large Email Dataset

Email Sherlock:Using Machine Learning to Extract Information from Large Email Datasets

Jay Gondin

Page 2: Email sherlock: Using Machine Learning to Extract Information from Large Email Dataset

Investigations and Emails ● Bear Stearns V.

Lehman Brothers● Enron● Hillary Clinton

Page 3: Email sherlock: Using Machine Learning to Extract Information from Large Email Dataset

Data: Hillary’s Emails

● 30,320 emails in dataset● 60,000 Meaningful Words● Unique Acronyms

○ Ex. Hillary Clinton = Rodham, HRC, Madam Secretary

○ Ex. Obama = President, Administration, Barack

○ Ex. White House = WH

Page 4: Email sherlock: Using Machine Learning to Extract Information from Large Email Dataset

Email Pros and Cons

● Emails may contain crucial information to solve an investigation.

● Unique acronyms may help vectorize emails

● Emails within a particular dataset have a fewer number of authors

● Often find duplicated text

● A majority of emails do not contain important and/or relevant information to an investigation

● Unique acronyms may make it more difficult to complete searches

● Clusters of emails tend to overlap

Pros Cons

Page 5: Email sherlock: Using Machine Learning to Extract Information from Large Email Dataset

Unsupervised Model

TFidF - vectorizerLSA - reduce dimension DBSCAN - cluster

Machine LearningSQLiteRaw Data Analyzed Clusters

Key Info:- Orphan tend to be less important

and/or were anonymized.- Dense clusters may contain more

information- DBSCAN -- Density-based spatial

clustering of applications with noise

Page 6: Email sherlock: Using Machine Learning to Extract Information from Large Email Dataset

Semi-Unsupervised Model & Query Expansion

Benghazi

Search Term

Neural Network (word2vec)

Tripoli

Stevens

Libyans

Consulate

Expanded Search Term Results (cluster)

Flask WebApp&

SQLite

Page 7: Email sherlock: Using Machine Learning to Extract Information from Large Email Dataset

Finding Connections:

Benghazi Libyans

● Clusters are based on meaning.

Page 8: Email sherlock: Using Machine Learning to Extract Information from Large Email Dataset
Page 9: Email sherlock: Using Machine Learning to Extract Information from Large Email Dataset

Sentiment Analysis

● High Polarity may indicate sensitive information.

Page 10: Email sherlock: Using Machine Learning to Extract Information from Large Email Dataset
Page 11: Email sherlock: Using Machine Learning to Extract Information from Large Email Dataset

Future developments

● Generalize to other Datasets● Adapt algorithm to prevent fraud● Develop graphical visualization● Record Users Activities to improve the software

Page 12: Email sherlock: Using Machine Learning to Extract Information from Large Email Dataset

Jay GondinMasters in Mathematics

Experienced Economic Analyst

[email protected]

github.com/jgondin

linkedin.com/in/gondin