Upload
scott-allen-mongeau
View
48
Download
4
Embed Size (px)
Citation preview
Company Confidential - For Internal Use Only
Copyright © 2017, SAS Insti tute Inc. Al l r ights reserved.
ANALYTICS TOOLS AND METHODS:
PRACTITIONER PERSPECTIVES
Guest LecturerScott Allen Mongeau
Data Scientist
Cyber Analytics
Cell: + 31 (0)6 8370 3097
BIG DATA AND BUSINESS ANALYTICS
Masters of Business and Information Management
2016 - 2017
dr Jan van Dalen
2
2
2
Education
• PhD (ABD)
• MBA
• MA Financial Mgmt
• Cert. Finance
• GD IT Mgmt
• MA Com Tech
Experience
• SAS InstituteSr. Mgr. Business Solutions
• DeloitteManager Analytics
• Nyenrode UniversityLecturer Analytics
• SARK7 Owner / Principal Consultant
• Genentech Inc. / Roche Principal Analyst / Sr. Mgr.
• AtradiusSr. R&D Engineer
• CFSICIO
Data Scientist
Cyber Analytics
+31 (0)64 235 3427
Scott Allen MongeauCertified Analytics Professional (CAP)
YouTube
• Introduction to Advanced Analytics
• Introduction to Cognitive Analytics
• TedX RSM: Data Analytics
Blog: sctr7.com
Twitter: sark7
Web: sark7.com
IT solutions
Research
methods
Finance
Data
analytics
Consulting
3
40 #1
14,000
93
80,000+
US $ 3.2 B
23%
SAS employees worldwide
of the top
100companieson the
GLOBAL 500 LIST
Annual reinvestment in
R&D
Continuous Revenue
Growth since 1976
Years of
BUSINESS
ANALYTICS
World’s
privately held
software company
LARGEST
Customer sites in 148 countries
DATAANALYTICS MARKET LEADER
4
Copyright © 2017, SAS Institute Inc. All rights reserved.
FORECASTING
DATA MINING /
MACHINE LEARNING
TEXT ANALYTICS
OPTIMIZATION
STATISTICS
Finding treasures in unstructured data
like social media or survey tools
that could uncover insights
about key business challenges
Mine transaction databases
to create models of likely
outcomes
Leveraging historical data
to drive better insight into
proactive decision-making
Analyze massive
amounts of data in
order to accurately
identify areas likely to
produce the most
profitable results
ANALYTICS SOLUTIONS
Data Management (Integration, Quality &
Governance)
5MOORE’S LAW: EXPONENTIAL GROWTH OF COMPUTING POWER
5
25,000 x
Home computers
High-capacity servers
Smartphone
explosion
Cloud, AI / Watson, IoT
2015
Company Confidential - For Internal Use Only
Copyright © 2015, SAS Insti tute Inc. Al l r ights reserved.
7
PEOPLE & ORGANIZATION:
DATA SCIENTIST ROLE
99
Calvin.Andrus (2012) http://en.wikipedia.org/wiki/File:DataScienceDisciplines.png
SEEKING THE
‘DATA SCIENTIST’
10
10
DATA SCIENCE
PROFESSIONAL
PERSPECTIVES
http://www.oreilly.com/data/free/2016-data-science-salary-survey.csp
1616
16
DATA ANALYTICS
• Data science
• Statistician
• Data miner / machine
learning
• Text analytics / mining
BUSINESS ANALYTICS
• Business analyst
• BI solutions
• Visualization / interface design
• Functional domain specialty
(i.e. marketing analytics)
DATA MANAGEMENT
• Information / data architecture
• Database management
• Data engineering
• Data quality / governance / MDM
OPERATIONS
• Analytics engineering / operations
• Security
• IT systems management
BUSINESS / ORGANIZATIONAL
• Decision Management
• Change management
• Analytics project management
• Domain expert / functional
specialty / business manager
DATA SCIENCE
PEOPLE/ROLES
17CORE DATA SCIENCE SKILLSET
17
IT
• BI/reports/dashboards
• Programming
• Systems/software dev
• Algorithms
• Systems administration
• User interface
design/visualization
Mathematics
• Econometrics
• Graph analysis
• Matrix mathematics
• Multivariate analysis
• Probability
• Survival analysis
• Statistics
• Spatial analysis
• Temporal analysis
Business Domain
• Finance
• Operations
• Sales/marketing
• HR
Data Engineering
• Big & fast data solutions
• Data manipulation/ETL
• Database design
• Data structures
• Graphical
• NOSQL
• Unstructured data
Data Science
• Machine Learning
• Optimization
• Predictive analytics
• Simulation
• Text/semantic analytics
Research
• Scientific method
• Experimental design
• Research methodologies
• Social science methods
• Survey research
Company Confidential - For Internal Use Only
Copyright © 2015, SAS Insti tute Inc. Al l r ights reserved.
18
TECHNOLOGY & TOOLS:
DATA ANALYTICS TOOLS & TECH
2020APPLIED TECHNIQUES & TECHNOLOGIES
20
• Algorithms
(ex: computational complexity, CS theory)
• Back-end programming
(ex: JAVA/Rails/Objective C)
• Bayesian/Monte-Carlo statistics
(ex: MCMC, BUGS)
• Big and distributed data
(ex: Hadoop, Map/Reduce)
• Business
(ex: management, business development, budgeting)
• Classical statistics
(ex: general linear model, ANOVA)
• Data manipulation
(ex: regexes, R, SAS, web scraping)
• Front-end programming
(ex: JavaScript, HTML, CSS)
• Graphical models
(ex: social networks, Bayes networks)
• Machine learning
(ex: decision trees, neural nets, SVM, clustering)
• Math
(ex: linear algebra, real analysis, calculus)
• Optimization
(ex: linear, integer, convex, global)
• Science
(ex: experimental design, technical writing/publishing)
• Simulation
(ex: discrete, agent-based, continuous)
• Solutions development
(ex: design, project management)
• Spatial statistics
(ex: geographic covariates, GIS)
• Structured data
(ex: SQL, JSON, XML)
• Surveys and marketing
(ex: multinomial modeling)
• Systems administration
(ex: *nix, DBA, cloud tech.)
• Temporal statistics
(ex: forecasting, time-series analysis)
• Unstructured data
(ex: noSQL, text mining)
• Visualization
(ex: statistical graphics, mapping, web-based dataviz)
SOURCE: “Analyzing the Analyzers”
http://www.datasciencecentral.com/profiles/blogs/how-
to-become-a-data-scientist?overrideMobileRedirect=1
Company Confidential - For Internal Use Only
Copyright © 2015, SAS Insti tute Inc. Al l r ights reserved.
30
PROCESS & METHODS:
DATA ANALYTICS
3333
VALUE
SO
PH
IST
ICA
TIO
N
DESCRIPTIVE
PREDICTIVE
PRESCRIPTIVE
What
happened?
What are
trends?
What to do?
3434
VALUE
SO
PH
IST
ICA
TIO
N
DESCRIPTIVE
PREDICTIVE
PRESCRIPTIVE
Business
Intelligence (BI)
Econometrics
Forecasting
Machine Learning
Operations
Management
3535
business valueTransactional
an
aly
tic
s m
atu
rity
Strategic
Advanced Analytics
DESCRIPTIVE
DIAGNOSTICS
PREDICTIVE
PRESCRIPTIVE
Identifying
Factors & Causes
Asp
irati
on
al
Tra
nsfo
rmed
Optimizing
Systems
Understanding
Social Context
& Meaning
SEMANTICData
visualization
DATA QUALITY
Business
Intelligence
Understanding
Patterns
Forecasting &
Probabilities
3636
CRISP DM
Provost; Fawcett. Data Science for Business
Chapter 2: Business Problems and Data Science Solutions
37
37
SAS ANALYTICS
LIFECYCLE
PROBLEM
FRAMING
DATA
SELECTION &
GATHERING
DATA
EXPLORATION
TRANSFORM &
SELECT
MODEL
BUILDING
MODEL
VALIDATION
MODEL
DEPLOYMENT
EVALUATE &
MONITOR
RESULTS
FRAMING &
DISCOVERY
EXPLANATION
& PREDICTION
3838
Fair use: illustrate publication and article of issue in question. The Economist.
http://en.wikipedia.org/wiki/Category:Fair_use_The_Economist_magazine_covers38
4141
41
Public domain Agricultural Research Service
http://en.wikipedia.org/wiki/File:Orange_juice_1.jpg
GNU Free Documentation License: Ibanix Suzuki Shahid DL650 motorcycle
http://commons.wikimedia.org/wiki/File:Suzuki_vstrom_dl650_motorcycle.jpg
Company Confidential - For Internal Use Only
Copyright © 2015, SAS Insti tute Inc. Al l r ights reserved.
42
PREDICTIVE ANALYTICS:
SUPERVISED MACHINE LEARNING
43
Supervised learning - predictive• K-Means
• Decision Trees (DT)
(random forests, boosted trees)
• Naïve Bayes classifier
• Neural networks
• Support Vector Machine (SVM)
• Ensembles / Ensemble Learning
Decision Tree
Machine Learning
Support Vector Machines
4444
MACHINE LEARNING PREDICTION (SUPERVISED)
CAR Engine
Training set Validation set
Non-criminal Criminal
NORMAL UNUSUAL
Device
Time of day
Source
location
IP
Threat
intelligence
Amount
At risk
profile
Destination
location
Secure
profile
Known
devices
Average
amount
Known
location
Known
destination
45
45
EXAMPLE MACHINE LEARNING TOOLS
Open source
•R
•Python
•Weka
Commercial
• SAS BASE & JMP
• SAS Enterprise Miner
• IBM SPSS
• Oracle Data Mining
• Rapid Miner
Ranjit Bose, (2009),"Advanced analytics: opportunities and challenges",
Industrial Management & Data Systems, Vol. 109 Iss 2 pp. 155 - 172
http://dx.doi.org/10.1108/02635570910930073
4848
• Data preparation
• Model development
• Model management
• Model deployment
http://www.sas.com/en_gb/insights/articles/analytics/
Industrialize-your-analytics-today.html
4949
business valueTransactional
an
aly
tic
s m
atu
rity
Strategic
Advanced Analytics
DESCRIPTIVE
DIAGNOSTICS
PREDICTIVE
PRESCRIPTIVE
Identifying
Factors & Causes
Asp
irati
on
al
Tra
nsfo
rmed
Optimizing
Systems
Understanding
Social Context
& Meaning
SEMANTICData
visualization
DATA QUALITY
Business
Intelligence
Understanding
Patterns
Forecasting &
Probabilities
5050
CONFUSION
MATRIX
A confusion matrix
separates out the
decisions made by
the classifier,
making explicit how
one class is being
confused for
another. In this way
different sorts of
errors may be dealt
with separately.
Foster & Fawcett. Data Science for Business
What you need to know about data mining and data-analytic thinking: Chapter 7: Decision Analytic Thinking
5151RECEIVER OPERATING
CHARACTERISTICS (ROC) &
AREA UNDER THE CURVE (AUC)
“A ROC graph is a two-
dimensional plot of a
classifier with false positive
rate on the x axis against
true positive rate on the y
axis.
ROC graph depicts relative
trade-offs that a classifier
makes between benefits
(true positives) and costs
(false positives).”
Provost; Fawcett. Data Science for Business
Chapter 8: Visualization Model Performance
Area Under the Curve (AUC):
area under a classifier’s curve
expressed as a fraction of the
unit square. Its value ranges
from zero to one.
5252
CUMULATIVE RESPONSE /
LIFT CURVE
• How much the line representing the
model performance is lifted up over
the random performance diagonal
Provost; Fawcett. Data Science for Business. Chapter 8: Visualizing Model Performance
• I.E. “our model gives a two times (or a 2X)
lift”: this means that at the chosen
threshold (often not mentioned), the lift
curve shows that the model’s targeting is
twice as good as random
Company Confidential - For Internal Use Only
Copyright © 2015, SAS Insti tute Inc. Al l r ights reserved.
53
DESCRIPTIVE ANALYTICS:
UNSUPERVISED MACHINE LEARNING
54
Unsupervised learning• Cluster analysis
• Factor analysis
• Self-Organizing Maps (SOMs)
k-nearest neighbors
Machine Learning
55
R Studio
Workflow
Configuration Data
Results
Scripting
environment
Graphical results
Models
MACHINE LEARNING R / R Studio
5656
DESCRIPTIVE
(UNSUPERVISED):
CLUSTER ANALYSIS
FOR PATTERN
DETECTION
Cluster Analysis using
SAS Enterprise Guide
Company Confidential - For Internal Use Only
Copyright © 2015, SAS Insti tute Inc. Al l r ights reserved.
57
BIG DATA:
BACKGROUND AND EXAMPLE
58ONLINE IN
60 SECONDS…
Qmee
http://blog.qmee.com/qmee-online-
in-60-seconds/
59
DATA ANALYTICS DRIVERS: V4C
59
Social and mobile Data analytics
Interactive platforms Real-Time systems•VOLUME
•VELOCITY
•VARIETY
•VARIABILITY
•COMPLEXITY
V4C
60
• Cases where prediction is
not “deterministic”
• Bayes rate
• Theoretical maximum accuracy
that can be achieved for a
problem
60
MODEL ERRORS: INHERENT
RANDOMNESS
61
• Bias: even with ‘Big Data’, model will
never reach perfect accuracy of true
model
• Example
• Linear regression model to predict
response to an advertising campaign…
• Model is an abstraction…
• True model always
more complex
61
MODEL ERRORS: BIAS
62
• Variance: procedures with more variance tend to
produce models with larger errors
• Accuracy tends to vary across training sets
• Given finite sample set…
• Different models emerge
from different samples
• Different models tend to
have different accuracy
62
MODEL ERRORS: VARIANCE
63
Big Data
• Complex model
• Many variables
• Low bias…
• but high variance
• Subject to overfitting
63
BALANCE: BIAS VERSUS VARIANCE
Strong models
– Tested abstraction
– Few, but significant
variables
– Low variance…
– but high bias
Jno. T-62 tank in Russian service. http://www.aviation.ru/jno/Kubinka02
http://commons.wikimedia.org/wiki/File:T-62_tank_in_Russian_service_(2).jpg
6464
Statistical Learning with Big Data
http://web.stanford.edu/~hastie/T
ALKS/SLBD_new.pdf
6565
Statistical Learning with Big Data
http://web.stanford.edu/~hastie/T
ALKS/SLBD_new.pdf
Company Confidential - For Internal Use Only
Copyright © 2015, SAS Insti tute Inc. Al l r ights reserved.
66
EXPLANATION:
CAUSAL MODELING
67
• Explanatory performance NOT EQUAL to predictive efficacy (and vice versa),
difference between inductive and deductive methods/thinking
• This is a (sometimes heated) methodological debate amongst
practitioners/academics…
• Is it really a debate, or a religious (professional/Kuhnian) dispute? Econometrics
+ machine learning (H. Varian)
EXPLANATORY
ANALYTICS
68
• Varian, Hal R. 2014. Machine Learning and Econometrics. Stanford lecture slides:
https://web.stanford.edu/class/ee380/Abstracts/140129-slides-Machine-Learning-and-Econometrics.pdf
• Varian, Hal R. 2013. Big Data: New Tricks for Econometrics. Paper:
http://people.ischool.berkeley.edu/~hal/Papers/2013/ml.pdf
MACHINE LEARNING
AND ECONOMETRICS
69
• Ensemble learning…
• Promising – averages over many predictive
cases to reduce impact of variance
• However, is CORRELATIVE, not CAUSAL
• CAUSAL data analysis requires • Investment in data acquisition
• Similarity measurements
• Expected value calculations
• Correlation understanding
• Identifying informative variables
• Fitting equations to data
• Significance testing
• Domain knowledge69
MODEL MANAGEMENT