Upload
gondin-jay
View
209
Download
2
Embed Size (px)
Citation preview
Email Sherlock:Using Machine Learning to Extract Information from Large Email Datasets
Jay Gondin
Investigations and Emails ● Bear Stearns V.
Lehman Brothers● Enron● Hillary Clinton
Data: Hillary’s Emails
● 30,320 emails in dataset● 60,000 Meaningful Words● Unique Acronyms
○ Ex. Hillary Clinton = Rodham, HRC, Madam Secretary
○ Ex. Obama = President, Administration, Barack
○ Ex. White House = WH
Email Pros and Cons
● Emails may contain crucial information to solve an investigation.
● Unique acronyms may help vectorize emails
● Emails within a particular dataset have a fewer number of authors
● Often find duplicated text
● A majority of emails do not contain important and/or relevant information to an investigation
● Unique acronyms may make it more difficult to complete searches
● Clusters of emails tend to overlap
Pros Cons
Unsupervised Model
TFidF - vectorizerLSA - reduce dimension DBSCAN - cluster
Machine LearningSQLiteRaw Data Analyzed Clusters
Key Info:- Orphan tend to be less important
and/or were anonymized.- Dense clusters may contain more
information- DBSCAN -- Density-based spatial
clustering of applications with noise
Semi-Unsupervised Model & Query Expansion
Benghazi
Search Term
Neural Network (word2vec)
Tripoli
Stevens
Libyans
Consulate
Expanded Search Term Results (cluster)
Flask WebApp&
SQLite
Finding Connections:
Benghazi Libyans
● Clusters are based on meaning.
Sentiment Analysis
● High Polarity may indicate sensitive information.
Future developments
● Generalize to other Datasets● Adapt algorithm to prevent fraud● Develop graphical visualization● Record Users Activities to improve the software
Jay GondinMasters in Mathematics
Experienced Economic Analyst
github.com/jgondin
linkedin.com/in/gondin