What is Big Data Analytics and Why Should I Care? is Big... · 2016-09-28 · The 3 V’s of Big...

Preview:

Citation preview

Instructor(s) Name Kirk Borne, Principal Data Scientist, Booz Allen Hamilton

What is Big Data Analytics and Why Should I Care?

Crystal City, VA

September 30, 2016

Workshop in Two Parts

Part 1: Big Data Analytics

Part 2: Going for the Gold

2

Outline for Part 1: Big Data Analytics

How did we arrive at Big Data?

Domains of Data

Data Science, ML, and Analytics

The Internet of Things = hyper-Big Data

Simple Applications Demonstrated

3

Outline for Part 1: Big Data Analytics

How did we arrive at Big Data?

Domains of Data

Data Science, ML, and Analytics

The Internet of Things = hyper-Big Data

Simple Applications Demonstrated

4

Ever since we first explored our world…

5 Source for graphic: http://www.livescience.com/27663-seven-seas.html

… We have asked questions about everything around us.

6 Source for graphic: https://jefflynchdev.wordpress.com/tag/adobe-photoshop-lightroom-3/page/5/

The result is…

As we collect evidence (data) to answer our

questions, the data leads to more questions, etc…

7

Source for graphic: http://www.airshipman.com/use-people-to-do-your-advertising/

The result is…

As we collect evidence (data) to answer our

questions, the data leads to more questions, etc…

which leads to BIG DATA!

8

Source: https://www.andertoons.com/data/cartoon/7468/the-data-weve-gathered-discussions-about-big-data-are-up-72

Exponential & Combinatorial Growth (all numbers quoted here are from circa 2014)

9

16 BILLION DVDs needed to store the internet traffic

generated in a single hour, a stack 3x the

height of Mount Everest.

150 BILLION Emails sent every day, up to 70% of which

are spam.

33 PERCENT of children born in the United States

have an online presence prior to birth.

100 HOURS of video uploaded to YouTube every minute

= 16 years of content each day.

30 BILLION pieces of content shared monthly

on Facebook.

300 MILLION photos updated to Facebook daily, nearly

20 times larger than all the photos in the

Library of Congress.

2.4 BILLION tweets every 72 hours from more than 550

million active users.

Defining Big Data

We collect evidence (data) to answer our questions about the world around us … How? Why? What if?

– … and that is how we end up in a world of BIG DATA!

Big Data refers to data collections in which “everything is now being quantified and tracked” (= full-population samples of everything = The End of Demographics!)

– Examples: Social networks (Twitter, YouTube), search & online histories, web logs, financial and e-commerce transactions, environment & health monitors (wearable devices, EHRs), IoT, Astronomy,…

– Huge quantities of data are now being used everywhere.

10

Source for graphic: http://hinalockim.blogspot.com/2012/08/6th-week-cognitive-learning.html

Outline

How did we arrive at Big Data?

Domains of Data

Data Science, ML, and Analytics

The Internet of Things = hyper-Big Data

Simple Applications Demonstrated

11

Big Data in Business: Monetization is a Big Challenge

12

Source for graphic: http://www.gladwinanalytics.com/blog/big-data-business-model-maturity-index-and-the-internet-of-things-iot

Big Data in Business & Government: Analytics-driven innovation

13

Source: http://www.gao.gov/products/GAO-16-659SP

Big Data in Government: R&D strategic plan

14

Source: https://www.whitehouse.gov/sites/default/files/microsites/ostp/NSTC/bigdatardstrategicplan-nitrd_final-051916.pdf

Big Data from your body!

15 Source for graphic: https://datafloq.com/read/body-source-big-data-infographic/413

Big Data in your face!

16

Source for graphic: http://qz.com/779625/none-of-your-pixelated-or-blurred-information-will-stay-safe-on-the-internet/

Nothing pixelated (or blurred) will stay safe on the internet.

Deep Learning algorithms can discover deep hidden patterns.

Big Data in Science: Discovery at Petascale & Exascale

17 http://www.extremetech.com/extreme/124561-ibm-to-build-exascale-supercomputer-for-the-worlds-largest-million-antennae-telescope

SKA = Square Kilometer Array

joint project: Australia and South Africa

http://www.ska.gov.au/

~5 exabytes (5,000,000 Terabytes) every day!

Big Data in Environmental Monitoring

18

From Data to Information to Knowledge to Understanding

Big Data in Environmental Monitoring

19

Big Data in Science: Example from Astronomy

20

LSST Construction began 2014. Survey period = 2022-2032

Deep, Wide, Fast Data to answer Big Questions about the Universe

21

LSST Key Science Drivers: Mapping the Dynamic Universe – Complete inventory of the Solar System (Near-Earth Objects; killer asteroids???)

– Nature of Dark Energy (Cosmology; Supernovae at edge of the known Universe)

– Optical transients (10 million daily event notifications sent within 60 seconds)

– Digital Milky Way (Dark Matter; Locations and velocities of 20 billion stars!)

LSST in time and space: – When? ~2022-2032 – Where? Cerro Pachon, Chile

Architect’s design

of LSST Observatory

LSST Summary: Big Data & Data Science

22

• http://www.lsst.org

• 3-Gigapixel camera

• One 6-Gigabyte image every 20 seconds

• 20 Terabytes every night for 10 years

• Repeat images of the entire night sky every 3 nights:

Celestial Cinematography

• 100-Petabyte final image data archive anticipated

all data are public!!!

• 20-Petabyte final database catalog anticipated

~20 trillion sources with 200+ database attributes each.

This is a combinatorial explosion!

~10 million events per night, every night, for 10 years.

Fast categorization and decisions (triage!) required.

Goal: understand our vast dynamic Universe

LSST Summary: Big Data & Data Science

23

• http://www.lsst.org

• 3-Gigapixel camera

• One 6-Gigabyte image every 20 seconds

• 20 Terabytes every night for 10 years

• Repeat images of the entire night sky every 3 nights:

Celestial Cinematography

• 100-Petabyte final image data archive anticipated

all data are public!!!

• 20-Petabyte final database catalog anticipated

~20 trillion sources with 200+ database attributes each.

This is a combinatorial explosion!

~10 million events per night, every night, for 10 years.

Fast categorization and decisions (triage!) required.

Goal: understand our vast dynamic Universe

DEEP

WIDE

FAST

VALUE

The 4 Rewards of Big Data in all Domains

o Knowledge Discovery – Data-to-Discovery (D2D)

o Data-driven Decision Support – Data-to-Decisions (D2D)

o Big ROI (Return On Innovation) – Data-to-Dollars or Data-to-Dividends (D2D)

– Innovative Applications of sense-making from sensors and sentinels everywhere

o Data Science for Social Good – Data for Good (D4G) – follow @DataSci4Good

24

http://thinkfuture.com/

Challenges to Achieving Rewards The 3 V’s of Big Data are not just hype – they represent really big challenges:

1. Volume (DEEP)

2. Variety (WIDE)

3. Velocity (FAST)

But… Volume is not the problem! Storage is manageable.

Data Science & Analytics (integrating and combining disparate data sources to achieve Data-to-Discovery, Data-to-Decisions, and Data-to-Dividends) are hard…

… especially on complex (diverse, high-Variety) and fast-moving (real-time, high-Velocity) data!

Focus on Value Creation through Advanced Analytics / Data Science in order to conquer these challenges.

25 Source for graphic: http://www.vitria.com/blog/Big-Data-Analytics-Challenges-Facing-All-Communications-Service-Providers/

Outline

How did we arrive at Big Data?

Domains of Data

Data Science, ML, and Analytics

The Internet of Things = hyper-Big Data

Simple Applications Demonstrated

26

Some Quick Definitions

Statistics = the practice (and science) of collecting and analyzing numerical data.

Machine Learning (ML) = mathematical algorithms that learn from experience (historical data).

Data Mining = application of ML algorithms to data.

Artificial Intelligence (AI) = application of ML algorithms to robotics and machines.

27 Source for graphic #1: http://www.satyavedism.com/mathematics-astrophysics/mathematics-resources Source for graphic #2: http://blogs.sas.com/content/subconsciousmusings/2014/08/22/looking-backwards-looking-forwards-sas-data-mining-and-machine-learning/

Data Science = application of scientific method to discovery from data (including statistics, machine learning, and more: visual analytics, machine vision, computational modeling & simulation, semantics, graphs, network analysis, data indexing schemes, …).

Analytics = the products of machine learning & data science.

Machine Learning: 4 Types of Discovery (algorithms that learn from experience)

1) Class Discovery: Finding new classes of objects (population segments), events, and behaviors. This includes: learning the rules that constrain the class boundaries.

2) Correlation (Predictive and Prescriptive Power) Discovery: Finding patterns and dependencies, which reveal new governing principles or behavioral patterns (the “customer DNA”).

3) Novelty (Surprise!) Discovery: Finding new, rare, one-in-a-[million / billion / trillion] objects and events.

4) Association (or Link) Discovery: Finding unusual (improbable) co-occurring associations.

28

The Data Analytics Revolution

Exploiting the Value Chain: from Digital Data to Information to Knowledge to Insights (and Action) From Sensors (Measurement & Data Collection) …

… Big Data (Deep, Fast, Wide)

to Sentinels (Monitoring & Alerts = Information) …

… Machine Learning

to Sense-making (Knowledge & Insight Discovery) …

… Data Science

to Cents-making (Your Applications of Data = Action!)

… Analytics

… Productizing and Actionizing your Big Data

29

Data Analytics has evolved with growth in data

5 Levels of Analytics Maturity: 1) Descriptive = hindsight : what happened?

2) Diagnostic = oversight : what is happening? and why is it?

3) Predictive = foresight : what will happen?

– Predictive : given x, find y (needs historical training data)

4) Prescriptive = insight : how can I prescribe a better outcome?

– Prescriptive : given y, find x (needs comprehensive data set)

5) Cognitive = the “right sight” : asking the right question, at

right time, in the right context, in order to make the right decision!

– Cognitive : the “360 view”, take it all in, ask new questions!

– …to identify your “next-best move” or “next-best action”

– “It is not what you look at that matters – it’s what you see’’ (Henry David Thoreau)

30

31

32

33

From Descriptive to Predictive to Prescriptive Analytics via Cognitive Analytics: Exploring “Data in Context” leads to new questions and new hypotheses …

34 http://www.boozallen.com/datascience

35

The Full Operational Data Analytics Spectrum

© Copyright 2016 Booz Allen Hamilton

Data Science and Analytics

Class Discovery

Correlation (Predictive / Prescriptive Power) Discovery

Surprise (anomaly) Discovery

Association (Link) Discovery

Each one can be applied at the 5 different levels of Analytics Maturity:

Descriptive → Diagnostic → Predictive → Prescriptive → Cognitive

36

The Future of Big Data Analytics and Data Science

37 http://www.boozallen.com/datascience

Machine Learning in our Lives

Your Purchase Preferences, Recommender Systems, Credit Scoring, Smart Phone auto-complete, …

38

PREDICT

OPTIMIZE

DISCOVER

DETECT

Your Thermostat, Your Commute Time and Routing, Personalized Learning, …

Your Health Issues (wearables), Your Best Deal (Bed & Breakfast or Restaurant), …

Your Social Sentiment, Flu Outbreaks, Credit Card Fraud, …

© Copyright 2016 Booz Allen Hamilton

Machine Learning in our Work

Predict outcomes, events, needs, costs, risks, product demand, … PREDICT

OPTIMIZE

DISCOVER

DETECT

Optimize processes, products, and people (delivery of services, supplies, personnel), …

Discover insights in publications, social media, quarterly business reports, electronic records, …

Detect fraud, anomalies in safety events, behaviors, outbreaks, data usage (HIPAA), cyber systems (data breaches), …

© Copyright 2016 Booz Allen Hamilton 39

Data Analytics in Medicine & Health Administration

40

1. Benefits Administration improvement (“ACO = HIE + Analytics”: process mining, best practices, cost-efficiency, success metrics validation)

2. Do Not Pay initiatives (payment error / fraud analytics) 3. Beneficiary Recommendations ("Amazon-style" predictive analytics, prescriptive modeling) 4. Consumer Engagement (personalized online web experience, "marketing analytics") 5. Health Information Exchange (HIE) Exploitation (population health discovery, link analysis,

ICD-10 mining) 6. Personalized Healthcare and Patient Wellness (wearables data-sharing/mining, health

baselining) 7. Personalized/Precision Medicine and Care Coordination (EHR, HIE monitoring / mining) 8. Predictive Medicine (readmissions, complications, adverse interactions) 9. At-Risk Precursor Analytics (early warning signals of cancer, diabetes, heart disease, suicidal /

mental health issues, ...) 10. Patient Trajectories Analysis (mining / segmentation of whole population EHR histories,

pathways, outcomes, outliers) 11. Learning Health System Decision Support (advanced analytics embedded in health system

data feeds) 12. What Question Should I Be Asking of My Data? (Cognitive Analytics)

© Copyright 2016 Booz Allen Hamilton – http://www.boozallen.com/datascience

Outline

How did we arrive at Big Data?

Domains of Data

Data Science, ML, and Analytics

The Internet of Things = hyper-Big Data

Simple Applications Demonstrated

41

Data Science: Applications and Use Cases are everywhere… Smart Apps (Find best price; real-time travel adjustments; type-ahead texting)

Predictive Retail (Dynamic Pricing, Smart Supply Chain, Precision Demand Forecasting)

Precision Marketing (SegOne, Personalized Real-time Ad Campaigns for Next Best Offer)

Smart Highways (Real-time intelligence among vehicles, weather, roads, repairs)

Precision Traffic (Self-driving & Self-parking Connected Cars)

Smart Cities (Growth, Dynamic Street-lighting, Smart Energy Usage)

Predictive Law Enforcement (Predictive, Prescriptive personnel & resource placements)

Smart Healthcare (Wearables, Personalized Medicine, Patient/Provider Monitoring)

Invisibles (under-the-skin smart sensors that measure, learn, respond) = The Internet of Emotions!

Personalized Online Education (Dynamic learning, Gamification, Real-time interventions)

Precision Forests, Farms, Vineyards,… (Data-driven Planning, Nurturing, Harvesting)

Fintech / Banks / Insurance (Fast Risk analysis, Fraud detection, Personalized services)

Smart Organizations (Talent Placement, Employee Retention, Workforce Deployment, Process Mining for Efficiencies, Workflow recommender engines)

Predictive Machines (Early Warning, Prescriptive Maintenance & Obsolescence, IoT, Industrial IoT) 42

The XYZ of Data Science: Intelligence at the edge of the network (Edge Analytics at the point of data collection)

Smart X

– Smart Cities

– Smart Highways

– Smart Supply Chain

Precision Y

– Precision Medicine

– Precision Farming

– Precision Pricing

Personalized Z

– Personalized Health

– Personalized Learning

– Personalized Shopping Experience

43

http://www.loopcayman.com/content/if-smart-cities-are-next-big-thing-what-about-smart-regions

Internet of Things

https://www.nsf.gov/news/news_images.jsp?cntn_id=122028

Everything Interconnected

https://www.nsf.gov/news/news_images.jsp?cntn_id=122028

The Internet of Things (IoT)

is an interconnected universe of Dynamic Data-Driven Application

Systems (DDDAS)

https://www.nsf.gov/news/news_images.jsp?cntn_id=122028

Drive Big Benefits with Big Data Analytics Triage

General example of Data Analytics Triage in IoT: Event Mining in Dynamic Big Data Collections for Actionable Intelligence:

Behavior modeling (anomaly & trend detection) and ad hoc inquiry for Discovery

Identifying, characterizing, & responding to events for data-driven Decisions

Deciding which events need immediate investigation and/or intervention = Action!

Many other examples: Web user engagement & recommendations (from web analytics data)

Customer churn early warning (from 360-view customer data)

Predictive Maintenance alerts (from machine / engine sensors)

Infrastructure Monitoring alerts (from ubiquitous sensors)

Supply chain monitoring (from manufacturing & shipping sensors)

Cybersecurity alerts (from network logs)

Preventive Fraud alerts (from financial applications)

Health alerts (from EHRs and national health systems)

Tsunami alerts (from geo sensors everywhere)

Social event alerts or early warnings (from social media)

47

Prescrip

tive

Ris

k M

itig

ati

on

Infusing Analytics Capability into your organization

48 © Copyright 2016 Booz Allen Hamilton

Booz Allen’s approach to helping organizations drive

competitive advantage through data analytics

Activities • Enrich • Integrate

and Transform Data

Methods • Descriptive

Statistics • Filtering • Aggregation

Activities • Reveal trends • Identify

Correlations • Learn

Patterns

Methods • Unsupervised

Learning • Clustering • Outlier

Detection

Activities • Classify

Signals • Predict Risks • Forecast

Resources

Methods • Random

Forest • Neural

Networks • Bayesian

Analysis • Collaborative

Filtering

Activities • Optimize

Resources

• Simulate Decision Outcomes

Methods • Genetic

Algorithms

• Integer Programming

• Non-Linear Programming

• Discrete Event Simulation

Acquisition, aggregation and enrichment of information from multiple entry points will help create a holistic

view that can enhance operations, reduce risk, provide powerful insight, and create value.

Enables Effective Operations

and Decision-Making

• Allows for accurate

analysis of trends

across the organization

against defined KPI’s

• Supports strategic C-

Suite decision making

• Reveals operational

risks and potential

bottlenecks in real-time

• Supports critical

information

infrastructure protection

efforts by early

detection of

vulnerabilities

Products

Reports | Dashboards |

Mitigations

360o Data Acquisition

Business Operations and

Performance Data

Logs: Systems, Customers,…

Reports, e-Docs, and Manuals

Open Data

Outline

How did we arrive at Big Data?

Domains of Data

Data Science, ML, and Analytics

The Internet of Things = hyper-Big Data

Simple Applications Demonstrated

49

Mars Rovers (metaphor for general use case)

50

Mars Rovers (metaphor for general use case)

51

• Mars Rover = intelligent data-gatherer, mobile data mining

agent, and autonomous decision-support system:

– Gathers data (in situ) for remote sensors

– Performs intelligent (autonomous, cognitive) data mining operations

• Class Discovery

• Correlation (Predictive and Prescriptive Power) Discovery

• Novelty Discovery

• Association Discovery

– Enacts on-board Intelligent Data Understanding & Decision Support

• “Stay here and do more, or move elsewhere”

• “Follow trend to more interesting, lucrative, and productive location”

• “Send results immediately, or store for later analysis”

From Sensors to Sentinels to Sense

52

• New knowledge and insights are acquired by monitoring and mining actionable data from all digital inputs.

–Sensors!

• Alerts are triggered autonomously, without intervention (when it is permitted), applying machine learning and actionable business decision rules for pattern detection and diagnosis.

–Sentinels! (embedded machine learning / data science algorithms)

• “Smart Sensors” (powered by Machine Learning-enabled sentinels) deliver actionable intelligence.

–Sense!

(applies to any application domain with streaming data from sensors)

Dynamic Data-Driven Application Systems (DDDAS)

4 steps from data to action = MIPS:

– Measurement – Inference – Prediction – Steering

This applies to any Network of Sensors:

– Web user interactions & actions (web analytics data), Cyber network usage logs,

Social network sentiment, Machine logs (of any kind), Manufacturing sensors, Health &

Epidemic monitoring systems, Financial transactions, National Security, Utilities and

Energy, Remote Sensing, Tsunami warnings, Weather/Climate events, Astronomical

sky events, …

– IoT (the Internet of Things) and M2M (Machine-to-Machine): e.g., connected cars,

manufacturing plants, transportation systems, locomotive and jet engines, power grid,

“smart home”, “smart cities”, “smart farms”,…

Machine Learning enables the “IP” part of MIPS:

– Pattern (Segment) Discovery

– Correlation (Trend) Discovery

– Novelty (Anomaly) Discovery

– Association (Link) Discovery

53

http://dddas.org

Alert & Response systems:

• Actionable insights from

streaming business data

• Automation of any data-driven

operational system

Dynamic Data-Driven Application Systems (DDDAS)

4 steps from data to action = MIPS:

– Measurement – Inference – Prediction – Steering

This applies to any Network of Sensors:

– Web user interactions & actions (web analytics data), Cyber network usage logs,

Social network sentiment, Machine logs (of any kind), Manufacturing sensors, Health &

Epidemic monitoring systems, Financial transactions, National Security, Utilities and

Energy, Remote Sensing, Tsunami warnings, Weather/Climate events, Astronomical

sky events, …

– IoT (the Internet of Things) and M2M (Machine-to-Machine): e.g., connected cars,

manufacturing plants, transportation systems, locomotive and jet engines, power grid,

“smart home”, “smart cities”, “smart farms”,…

Machine Learning enables the “IP” part of MIPS:

– Pattern (Segment) Discovery

– Correlation (Trend) Discovery

– Novelty (Anomaly) Discovery

– Association (Link) Discovery

54

http://dddas.org

Alert & Response systems:

• Actionable insights from

streaming business data

• Automation of any data-driven

operational system

…but the greatest of V’s is Variety

Source for graphic: http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/BrunnerDPS.pdf

The discovery and separation of classes improves when a sufficient

number of “correct” features are available for exploration:

(a) 2 classes are discovered and become separable:

(b) One trend line becomes 2 clusters:

14 56

Feature Selection and Projection

Feature Selection is important to disambiguate different classes. More importantly, Class Discovery depends on selecting the right features!

57

Feature Selection and Model Bias: choosing features in the dark

I picked out two socks from my sock drawer this morning!

It was still dark, but that shouldn’t matter, right? After all, they are the same size … THE SAME ?!?

The Era of Big Data represents the END OF DEMOGRAPHICS (i.e., our models should no longer be based on and biased by a limited selection of attributes and features)

58

59

Insufficient Variety: multiple classes are not distinguishable using this one feature

Sufficient Variety: two classes are discovered using this new feature

60

Another example of class discovery in a data set: by exploring high-variety (high-dimension data)

The separation and discovery of classes improves when a sufficient number of “correct”

features are available for exploration:

61

Source for graphic: http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/BrunnerDPS.pdf

Clustering for Persona Discovery and Customer Personalization

Exploiting the 3rd V of Big Data

(Data Exploration and Data Exploitation)

1. Volume

2. Velocity

3. Variety

62

Digital Marketing Analytics Evolution: From Demographics to Personalization to Hyper-personalization

63

http://www.webtwit.com/digital-marketing-company-india.html

360 Customer View in Digital Marketing

64

Clustering = Class / Segment Discovery

Clustering = the process of partitioning a set of data into subsets (segments or clusters) such that a data element belonging to any chosen cluster is more similar to data elements belonging to that same cluster than to the data elements belonging to other clusters. = Grouping together similar items, and separating dissimilar items = Identifying similar characteristics, patterns, or behaviors among subsets of the data elements.

Challenge #1) No prior knowledge of the number of clusters. #2) No prior knowledge of semantic meaning of the clusters. #3) Different clusters are possible from the same data set! #4) Selecting different features can lead to different clusters.

65

Types of Clustering

In general terms, there are two approaches to clustering:

– Partitional – One set of clusters is created (e.g., K-Means clustering – choose K, the number of clusters).

– Hierarchical – Nested sets of clusters are created sequentially.

66

Example of Hierarchical Clustering

67

67

Starting with (a), then going to (e): Bottom-up, Agglomerative Clustering

Starting with (e), then going to (a): Top-down, Divisive Clustering

( ( (

( (

The “Google Maps” view for your Customer Space

https://www.researchgate.net/figure/273456906_fig3_Figure-4-Example-of-hierarchical-clustering-clusters-are-consecutively-merged-with-the

Hierarchical Clustering Approaches

Clusters are created at multiple levels – creating a new set of clusters at each level.

There are 2 types of hierarchical clustering:

– Agglomerative Clustering

Bottom-Up

Initially, each item is in its own cluster.

Then, clusters are merged together iteratively ...

– ... based upon similarity of data items.

– Divisive Clustering

Top-Down

Initially, all items are in one cluster.

Then, large clusters are successively divided ...

– ... based upon distance between data items.

68

Segmentation of One = ‘SegOne’ Marketing = Personalization

Marketing Campaign Segments = Customer Personas

Digital Marketing: your “Mars Rover” in a box – 1

Mining multi-channel big data streams (across your organization) o Class Discovery

o Correlation (Predictive and Prescriptive Power) Discovery

o Novelty Discovery

o Association Discovery

Hierarchical Segmentation for Personalization (“SegOne Marketing”)

Decision Automation in a rich content (Big Data) environment

69

Digital Marketing: your “Mars Rover” in a box – 2

Your own “Smart Sentinel (Mars Rover)” – Your business rules determine the decision points,

alerts, and responses (IF-This-Then-That = IFTTT).

– Move beyond historical hindsight and oversight (Descriptive & Diagnostic Analytics)

– Apply insight and foresight (Predictive & Prescriptive Analytics)

– Achieve right sight for your next-best move (Cognitive Analytics)

the 360 view enables the right question, right action, for the right customer, at the right place, at the right time, in the right context.

70

Data Science improves your odds in the fundamental business gambit: RISK versus REWARD

71

http://www.telegraph.co.uk/news/worldnews/europe/russia/10061780/Russian-convicts-beat-Americans-in-cyber-chess-battle.html

Are you ready to reap rewards (the 3 D2D’s) from Hyper-Big Data through Data Science?

Learning from data (Data Science): – Clustering (= New Class discovery, Segmentation)

– Correlation, Trend, Association, & Link discovery

– Classification, Diagnosis (Predictive power discovery)

– Outlier, Anomaly, Novelty detection (Surprise discovery)

… for business value (the 3 D2D’s): – Data-to-Discoveries

– Data-to-Decisions

– Data-to-Dividends (big ROI = Return on Innovation)

72

http://thinkfuture.com/

http://www.hadoop360.com/blog/iot-101-everything-you-need-to-know-to-start-your-iot-project

http://www.dataev.com/it-experts-blog/bid/297713/The-Big-Data-Challenges-of-a-Biotechnology-Startup-Company

SUMMARY – Part 1 Big Data is not about “Big” or “Data”

Big Data is a concept, focused on:

1) Data Science Discovery = Data-to-Discovery

2) Analytics Solutions = Data-to-Decisions

3) Value Creation = Data-to-Dividends (Data-to-Dollars)

… The Right ROI in a Big Data World = Return On Innovation

Machine Learning and Data Science are about:

a) Digital data transformations from Sensors to Sentinels to

Sense-Making; and

b) Insights through Predictive & Prescriptive Power Discovery

and Cognitive Exploration in DEEP, WIDE, FAST data!

73

http://www.boozallen.com/datascience @KirkDBorne

Part 2 – Going for the Gold

Steps to Cognitive Analytics

The Data Science Bowl (data for good)

Dare to Change the World

74

Part 2 – Going for the Gold

Steps to Cognitive Analytics

The Data Science Bowl (data for good)

Dare to Change the World

75

Simple Example of

Descriptive, Predictive,

Prescriptive, and

Cognitive Analytics

© Copyright 2016 Booz Allen Hamilton 76

Trend Lines in data: Descriptive!

Warning: it is tempting to over -f it every

wiggle in the data?

92 Naturally Occurring Elements

All Measurements are Degree Kelvin

© Copyright 2016 Booz Allen Hamilton

92 Naturally Occurring Elements

77

This is a better fit to the trend line…

for use in Predictive & Prescriptive analytics!

92 Naturally Occurring Elements

All Measurements are Degree Kelvin

© Copyright 2016 Booz Allen Hamilton

92 Naturally Occurring Elements

78

Sometimes we are

tempted to think that

outliers are just noise.

Trend Lines and

Outliers:

© Copyright 2016 Booz Allen Hamilton 79

92 Naturally Occurring Elements

Sometimes we are

tempted to think that

outliers are just noise.

Trend Lines and

Outliers:

Where is the

real discovery?

© Copyright 2016 Booz Allen Hamilton 80

92 Naturally Occurring Elements

Add some

context to

the data!

…that diagonal line in the

plot (where melting point =

boiling point) provides some

context (your expectations)!

Trend Lines and

Outliers:

© Copyright 2016 Booz Allen Hamilton 81

92 Naturally Occurring Elements

Why is that

point below

the line?

…that diagonal line in the

plot (where melting point =

boiling point) provides some

context (your expectations)!

Trend Lines and

Outliers:

© Copyright 2016 Booz Allen Hamilton 82

92 Naturally Occurring Elements

There’s

the Real

Discovery!

Trend Lines and

Outliers:

© Copyright 2016 Booz Allen Hamilton 83

92 Naturally Occurring Elements

Arsenic!

Trend Lines and

Outliers:

© Copyright 2016 Booz Allen Hamilton

Melts @ 1089oK

Boils @ 889oK

84

92 Naturally Occurring Elements

Arsenic!

Trend Lines and

Outliers:

© Copyright 2016 Booz Allen Hamilton

Melts @ 1089oK

Boils @ 889oK

85

Cognitive Surprise Discovery

(outlier / anomaly / deviation detection)

Knowing the right question to ask!

Part 2 – Going for the Gold

Steps to Cognitive Analytics

The Data Science Bowl (data for good)

Dare to Change the World

86

55

The Catalyst

Booz Allen’s Data

Science Practice

Our Passion for

Data Science

Lack of a National

Data Science Event http://www.boozallen.com/datascience/

http://www.datasciencebowl.com/

(www.DataScienceBowl.com)

Citizen Data Science!

About Kaggle

● World’s largest online data science competition

community

● Over 500,000 members across ~200 countries

● Community uses diverse backgrounds to solve some of the

most complex data science problems in the world

● Extremely strong brand within the data science community

“We and the broader data science community

share a common passion, culture, and vision

for using data science for social good.”

(www.DataScienceBowl.com)

Last year’s Grand Challenge:

$175,000 prizes (provided by Booz Allen)

Assess ocean health at a speed and scale

that were previously impossible.

(www.DataScienceBowl.com)

Services provided by Plankton: • Provide food for humans and marine animals • Produce oxygen (phytoplankton) • Remove CO2 from the atmosphere • Contribute to global biodiversity • Provide biomedical products • Major source of nutrients for indigenous populations

Assess Ocean Health by classifying

118 Classes of Plankton in >160K images

(www.DataScienceBowl.com)

● Read all about it here: http://benanne.github.io/2015/03/17/plankton.html

● Deep Learning with convolutional neural networks

● Average accuracy of 81% across all 121 plankton classes

● Code available at https://github.com/benanne/kaggle-ndsb

● More than 1000 competing teams

● More than 15,000 submissions

● Recap: http://www.datasciencebowl.com/recap-first-annual-data-science-bowl/

Last year’s winning approach

(www.DataScienceBowl.com)

Data providers and partners: Drs. Michael Hansen and Andrew Arai, of the NIH National Heart, Lung, and Blood Institute (NHLBI); and the Children’s National Medical Center. Other partners include: NVIDIA; American College of Cardiology; The Children’s Inn at NIH; FNIH (Foundation for the NIH); MedStar Institute for Innovation; and more. The Challenge: improve diagnosis of heart disease through faster, more accurate measurement of ejection fraction (end-systolic and end-diastolic volumes) in cardiac MRI data. The Data: time-series of MRI scans from over 1000 patients.

(www.DataScienceBowl.com)

We did it again this year with a

$200K heart health challenge!

We did it again this year with a

$200K heart health challenge!

Data providers and partners: Drs. Michael Hansen and Andrew Arai, of the NIH National Heart, Lung, and Blood Institute (NHLBI); and the Children’s National Medical Center. Other partners include: NVIDIA; American College of Cardiology; The Children’s Inn at NIH; FNIH (Foundation for the NIH); MedStar Institute for Innovation; and more. The Challenge: improve diagnosis of heart disease through faster, more accurate measurement of ejection fraction (end-systolic and end-diastolic volumes) in cardiac MRI data. The Data: time-series of MRI scans from over 1000 patients.

It was not just about improving Cardio Imaging Analytics. It’s about Reed’s story: One in 100 children are born with congenital heart defects!

(www.DataScienceBowl.com)

Results: Volume Predictions

Data Science Bowl co-winner Tencia Lee visits NIH NHLBI to discuss winning algorithm

http://www.datasciencebowl.com/leading-and-winning-team-submissions-analysis/

(www.DataScienceBowl.com)

2016 Format & Logistics

● Web-based competition (www.DataScienceBowl.com)

● Competition Period: 14 December 2015 through 14 March 2016

● Models were quantitatively scored (i.e., no subjective judging panel)

● We are now seeking ideas for the 2017 Data Science Bowl #3 Challenge:

http://www.datasciencebowl.com

1 GRAND

CHALLENGE

90 DAYS =

$200,000 PRIZES

1st place: $125,000

2nd place: $50,000

3rd place: $25,000

NVIDIA also provided

complimentary GPU

Technology Conference

passes to top 3 teams

(www.DataScienceBowl.com)

Part 2 – Going for the Gold

Steps to Cognitive Analytics

The Data Science Bowl (data for good)

Dare to Change the World

100

Big Data + the IoT + Citizen Data Scientists =

= Partners in Sustainability The Internet of Things (IoT):

• Knowing the knowable via deep, wide, and fast data from ubiquitous sensors!

Big Data: • In the Big Data era,

Everything is Quantified and Tracked!

• Examples: – Social Networks – Population & Personal Health – Smart Cities & Highways – Retail Analytics – Cybersecurity – IoT = Internet of Things

17 SDGs are KPIs

for the World! (currently, the SDGs have 229

key performance indicators)

Sustainability Development Goals

101

Environmental Monitoring with IoT data

Check out and participate in the

EPA Smart City Air Quality Challenge: https://www.epa.gov/innovation/epa-challenges-prizes

EPA is challenging communities to deploy

hundreds of air quality sensors and

to make the data public!

Submissions due October 28, 2016

102

$100,000 in prizes

@KirkDBorne

@DataSci4Good

@BoozAllen

Are you ready to

change the world

with Big Data

Analytics?

LISTEN

READ www.boozallen.com/datascience

The Field Guide to Data Science

Building a Data Science Capability

Data Science Answers on Demand

10 Signs of Data Science Maturity

© Copyright 2016 Booz Allen Hamilton

Booz | Allen | Hamilton

PARTICIPATE datasciencebowl.com

Thank you!

Contact information:

kirk.borne@gmail.com

@KirkDBorne

http://www.boozallen.com/datascience

104

Recommended