Email sherlock: Using Machine Learning to Extract Information from Large Email Dataset

Email Sherlock:Using Machine Learning to Extract Information from Large Email Datasets

Jay Gondin

Investigations and Emails ● Bear Stearns V.

Lehman Brothers● Enron● Hillary Clinton

Data: Hillary’s Emails

● 30,320 emails in dataset● 60,000 Meaningful Words● Unique Acronyms

○ Ex. Hillary Clinton = Rodham, HRC, Madam Secretary

○ Ex. Obama = President, Administration, Barack

○ Ex. White House = WH

Email Pros and Cons

● Emails may contain crucial information to solve an investigation.

● Unique acronyms may help vectorize emails

● Emails within a particular dataset have a fewer number of authors

● Often find duplicated text

● A majority of emails do not contain important and/or relevant information to an investigation

● Unique acronyms may make it more difficult to complete searches

● Clusters of emails tend to overlap

Pros Cons

Unsupervised Model

TFidF - vectorizerLSA - reduce dimension DBSCAN - cluster

Machine LearningSQLiteRaw Data Analyzed Clusters

Key Info:- Orphan tend to be less important

and/or were anonymized.- Dense clusters may contain more

information- DBSCAN -- Density-based spatial

clustering of applications with noise

Semi-Unsupervised Model & Query Expansion

Benghazi

Search Term

Neural Network (word2vec)

Tripoli

Stevens

Libyans

Consulate

Expanded Search Term Results (cluster)

Flask WebApp&

SQLite

Finding Connections:

Benghazi Libyans

● Clusters are based on meaning.

Sentiment Analysis

● High Polarity may indicate sensitive information.

Future developments

● Generalize to other Datasets● Adapt algorithm to prevent fraud● Develop graphical visualization● Record Users Activities to improve the software

Jay GondinMasters in Mathematics

Experienced Economic Analyst

gondin@gmail.com

github.com/jgondin

linkedin.com/in/gondin

Email sherlock: Using Machine Learning to Extract Information from Large Email Dataset

Software

Sherlock II Sherlock II : Environnement de développement dapplications Sherlock II : Editeur dontologies Sherlock II : Aide à la découverte de connaissances

Map/Reduce on the Enron dataset - CWI · Map/Reduce on the Enron dataset We are going to use EMR on the Enron email dataset: emaildata

Sherlock holmes - memorias de sherlock holmes

Sherlock Holmes 9 - The Case Book of Sherlock Holmes

brirsa.files.wordpress.com · Web view2020. 3. 31. · Sherlock Holmes. Arthur Conan Doyle. Sherlock. Holme. s. Arthur Conan Doyle. Sherlock. Holme. s. Arthur Conan Doyle. Sherlock

Minister Seán Sherlock TD Mallow - Labour Party...Tel: 022 53523 Fax: 022 57761 Email: sean.sherlock@oir.ie Sean Sherlock TD @seansherlocktd SEÁN SHERLOCK TD Minister for Development,

Structure in the Enron Email Dataset · 2005. 1. 12. · Structure in the Enron Email Dataset P.S. Keila and D.B. Skillicorn School of Computing Queen’s University fkeila,skillg@cs.queensu.ca

AI for Document Understanding - Data Innovation Lab: TUM ... · Dataset • RVL-CDIP dataset Letter Email [14] [15] Memo Filefolder Form Handwritten Invoice Advertisement Budget News

Sherlock Skullprint

2NC Sherlock

SHERLOCK Ð TEACHERÕS NOTES - Blue Mango Theatrebluemangotheatre.com/pdf/m_pedagogico/SHERLOCK... · answers: see song lyrics ... worksheet 3 track 2- sherlock- the sherlock song

Email Classification Results for Folder Classification on Enron Dataset

John Watson Sherlock Holmes John Watson Sherlock Holmes

homepages.cwi.nlhomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-BDI… · Web viewMap/Reduce on the Enron dataset. We are going to use EMR on the Enron email dataset:

Transforming sherlock

Solving Sherlock

Sherlock Wiki

Exploring the Enron Email Dataset with Kiji and Hive

Structure in the Enron Email Dataset - Queen's Universityresearch.cs.queensu.ca › ~skill › enron.pdf · Structure in the Enron Email Dataset P.S. Keila and D.B. Skillicorn School

Structure in the Enron Email Dataset - Queen's Universitymaroon.cs.queensu.ca/home/skill/enron.pdf · Structure in the Enron Email Dataset ... We investigate the structures present