Upload
others
View
16
Download
0
Embed Size (px)
Citation preview
Introduction to Machine Learning & Data Analytics
Agenda – 2:00 pm – 2:45 pm
2:00 – 2:05Introductions and Session Overview
2:05 – 2:10Machine Learning & Predictive Analytics Background
2:10 – 2:30 Sample ML/PA Study & Findings
2:30 – 2:45Expert Panel Q & A
Founded in 2004 as a public sector IT consulting firm, Infiniti has evolved into a public sector cloud services and consulting organization with a reputation for delivering results on time and on budget.
Infiniti - Who We Are…
Harnessing a deep commitment to state & local government, education, and healthcare; Infiniti aims to improve the lives of students through innovation and technology.
Infiniti - Where We Work
Cloud Education Gov’t Agency Healthcare IV&V MSP
Machine Learning & Predictive Analytics
Deploying analytical IT tools is
relatively easy.
Understanding how they might
be used is much less clear.
Machine Learning & Predictive Analytics
> Typically start with sensing problems or potential opportunities, which may initially just be somebody’s hunch.
> Often move on to develop theories about the existence of a particular outcome or effect, generate hypotheses, identify relevant data, and conduct experiments.
> They are opportunities for discovery.
Focus more on the “I” and less on the “T” in IT
More like scientific research than traditional IT initiatives. Leads to specific targeted actions.
Our Predictive Analytics Process
The cycle of analyzing, transforming and learning
can be repeated many times
Popular Machine Learning Use Cases
Fraud / Anomaly Detection
Targeted Citizen Outreach
Business / Operational efficiency
Educational Outcome Predictions
Content Personalization
Document Classification
John Gray
Sample ML/PA Study & Findings
Problem Statement & Project Objective
Tasks
Tools and Environments
Deliverables
Roles
Schedule/Duration
Problem Statement & Project Objective
The California public sector client has surveys from millions of people who
apply online. A small percentage give negative feedback. The feedback is
entered as free form text. Client wants to analyze this text to identify specific
areas of the application process that need to be improved.
The objective of this project is to perform text processing, analysis, and
clustering to understand survey comments from dissatisfied users and
determine the parts of the application process that might need improvement.
(This is a “starter” ML/PA project – client expects us to help with more complex and higher benefit projects in the future)
Tasks - Typical
1. Define the problem. Work with customers to get a good understanding of the
specific questions they want to get answered
2. Analyze existing customer data. If not sufficient, work with customer to collect
additional / relevant data
3. Perform ETL (Extract, Transform & Load) and complete data integrity
checks. Make sure there are no issues with data (missing data, statistical
anomalies, etc.)
4. Make predictions and test outcomes. Model development - feature
engineering and predictive modeling
5. Test predictions for accuracy and validity. Improve / refine until results are
satisfactory
6. Deploy in production
7. Train / transition to customer’s team (or continue to support if required)
8. Discover other potential opportunities. Provide suggestions on other
questions that can be asked
Tasks – This Project
• Perform Sentiment Analysis - Provides insight into positive (or) negative emotions communicated in the textual data
• Process Word Cloud - Visual representation of key words communicated.
Sizes indicate relative importance/frequency.
• Perform Clustering - An unguided (Unsupervised Learning) machine learning technique that reveals underlying themes in text source
• Temporal Analysis of Negative Sentiment - We look at changes in negative sentiment over time – this might correlate to some client event
that occurred or in the world in general.
Sentiment Analysis – Steps and Results
• Evaluated a couple of modelso Logistic Regression, Naïve Bayes
• Logistic regression performed better
• Classification scoreso Logistic regression
• Accuracy – 93.7%
• Precision – 95.6%, Recall – 97.4%, Fscore – 96.5%
o Naïve Bayes
• Accuracy – 90.9%
• Precision – 96.7%, Recall – 93%, Fscore – 94.8%
Sentiment correlates very well with the user provided experience rating
Sentiment Analysis - Results
Words identified align well with sentiment
Generate Word Cloud - Tasks
• Generate key words that dominate negative comments
• Survey comments transformed using below text pre-processing steps• Stemming
• Removing most common words (I, we, is etc.)
• Spellcheck
•
Word Cloud - Results
Clustering - Tasks
• Used K-Means Clustering model
• The Process• Text pre-processing
• Convert text to numbers: Term Frequency – Inverse Document Frequency Transformation
• Run K-Means for assigned number of clusters
• Generate top-n key words that most represent each cluster
• Analyze output to identify key insights
• Tune, Iterate: Algorithm parameters, number of clusters, etc.
Clustering - Results
Cluster Key Words Theme
0 college, times, just, apply, confusing, did, student, need, website, difficult No clear theme
1 time, kept, consuming, time consuming, waste, logging, waste time, kept logging , times, page
Potential Website Issues
2 process, application process, long, times, college, student, just, class, students, online
No clear theme
3 long, takes, way long, took, way, unnecessary, process, complicated, tedious, personal
Time Consuming
4 school, personal, sexual, high, high school, orientation, sexual orientation, personal information, personal questions, college
PersonalInformation
Clustering model identified some generic trends on potential sources of dissatisfaction
Temporal Analysis
2014 had a spike in number of negative comments
Tools/ Environment
Environment can be on-premise, hybrid cloud, or cloud.NetApp storage provide excellent performance for this type of application
Tools and Environment – This Project
Tools/Environment:
• Secure AWS Environment with following open source tools:
• Python / Natural language toolkit (NLTK) package
• Open source machine learning tools: Python / Scikit-learn package
Roles, Schedule, and Next Steps
Roles on this project:
• Two Data Scientists (Harsha & Ananth)
• Project Manager (part time)
• AWS Solution Architect (to build environment)
Schedule:
• Less than three elapsed months from concept to completion
• Approximately three weeks of actual work
Next Steps:
• This client has at least half a dozen other ML projects
• Fraud, where to apply expert guidance, …
Thank You
Panel Discussion