Email sherlock: Using Machine Learning to Extract Information from Large Email Dataset

Email Sherlock:Using Machine Learning to Extract Information from Large Email Datasets

Jay Gondin

Investigations and Emails ● Bear Stearns V.

Lehman Brothers● Enron● Hillary Clinton

Data: Hillary’s Emails

● 30,320 emails in dataset● 60,000 Meaningful Words● Unique Acronyms

○ Ex. Hillary Clinton = Rodham, HRC, Madam Secretary

○ Ex. Obama = President, Administration, Barack

○ Ex. White House = WH

Email Pros and Cons

● Emails may contain crucial information to solve an investigation.

● Unique acronyms may help vectorize emails

● Emails within a particular dataset have a fewer number of authors

● Often find duplicated text

● A majority of emails do not contain important and/or relevant information to an investigation

● Unique acronyms may make it more difficult to complete searches

● Clusters of emails tend to overlap

Pros Cons

Unsupervised Model

TFidF - vectorizerLSA - reduce dimension DBSCAN - cluster

Machine LearningSQLiteRaw Data Analyzed Clusters

Key Info:- Orphan tend to be less important

and/or were anonymized.- Dense clusters may contain more

information- DBSCAN -- Density-based spatial

clustering of applications with noise

Semi-Unsupervised Model & Query Expansion

Benghazi

Search Term

Neural Network (word2vec)

Tripoli

Stevens

Libyans

Consulate

Expanded Search Term Results (cluster)

Flask WebApp&

SQLite

Finding Connections:

Benghazi Libyans

● Clusters are based on meaning.

Sentiment Analysis

● High Polarity may indicate sensitive information.

Future developments

● Generalize to other Datasets● Adapt algorithm to prevent fraud● Develop graphical visualization● Record Users Activities to improve the software

Jay GondinMasters in Mathematics

Experienced Economic Analyst

[email protected]

github.com/jgondin

linkedin.com/in/gondin

mailto:[email protected]

mailto:[email protected]

https://github.com/jgondin

https://github.com/jgondin

Software

Email sherlock: Using Machine Learning to Extract Information from Large Email Dataset