Big data beyond the hype may 2014

Big Data Beyond the Hype: Trends, Pi6alls and Building Competency

May 2014 1 5/29/14

Agenda

•  Why Big Data •  Technology Capabili<es •  Organiza<onal Evolu<on •  Case Study •  Conclusions

5/29/14 2

Big Data

5/29/14 3

•  Data sets so large and complex that they become awkward to work with using standard tools and techniques

•  Google, Quantcast, Yahoo, LinkedIn… customer innova<on in open source

Patterns from Delivering Big Data at Scale

eCommerce 2 of Global Top 5

Retail 2 of Global Top 5

Social Networking Global #1

Banking 4 of Global Top 10

Credit Issuer 2 of Global Top 5

Financial Data Services 2 of Global Top 5

Financial Exchanges Global #2

Brokerage & Mutual Funds 2 of Global Top 5

Asset Management Global #1

Semiconductor 2 of Global Top 5

Data Storage Devices 3 of Global Top 5

Disk Drive Manufacturing Global #1

TelecommunicaJons 2 of Global Top 5

Media & AdverJsing 2 of Global Top 4

Internet TransacJon Security Global #1

5/29/14 4

Manufacturing Analytics

1 Improve Yield -‐ reduced scrap, faster <me to market

2 Ease data access – quick search access to all relevant data for all history in one place, not samples, summary or recent data

3 Process Efficiency – iden<fy redundant tests, op<mize throughput

4 Component Analysis – proac<ve discovery of component problems, compa<bility detec<on

5 Service Revenue GeneraJon -‐ Leverage customer support to generate new revenues by increasing customer reten<on, support renewals, and upgrading customers to new products and licenses.

5 5/29/14

Product Analytics

1 Issue ResoluJon -‐ Faster resolu<on of issues through deep analysis of integrated device log, product, and configura<on data across the install base.

2 Support Metrics -‐ Gain new insights into dimensions impac<ng service metrics by combining product use technical data with business data. E.g., number of <ckets resolved, length of calls, resource alloca<on, customer sa<sfac<on.

3 ProacJve Service -‐ Iden<fy and address poten<al issues rather than wai<ng to react to them aXer they happen. I.e., intelligence for condi<on-‐based maintenance.

4 Customer Self-‐Service -‐ Drive higher rates of customer self-‐service by providing customers access to tools and resources to answer their own ques<ons and solve their own issues by analyzing past customer ac<vity and integra<ng technical details from customer systems into self-‐service portals.

5 Service Revenue GeneraJon -‐ Leverage customer support to generate new revenues by increasing customer reten<on, support renewals, and upgrading customers to new products and licenses.

6 5/29/14

Clickstream Analytics

1 Timely Processing -‐ Data volumes are increasing clogging data warehouse, slow and out of date.

2 Cost Efficiency & Scale -‐ Data volumes are increasing clogging data warehouse, slow and out of date.

3 Flexible Tagging – Analyze site without having to push data to tags, classify ac<vity based on sec<ons, URLs more flexibly than tags.

4 Converged AnalyJcs -‐ Understand and improve web, mobile, cross-‐channel customer ac<vity. Siloed apps costly & inflexible – e.g., tags require feeding data about user demographics, segments. Mash up interac<ons with data like geographic, demographic, social, call center, purchases.

5 Deeper ExploraJon -‐ Analyze sequences of events in customer journeys. Deep drill down (how do users with the latest iPhone and iOS compare to latest Samsung and Android). Cohort analysis -‐ those who buy books vs spor<ng goods.

6 PredicJve Models – next best offer recommenda<ons, personaliza<on, customer value, microsegmenta<on, ad aeribu<on…

7 5/29/14

Agenda

•  Why Big Data •  Technology CapabiliJes •  Organiza<onal Evolu<on •  Case Study •  Conclusions

5/29/14 8

Big Data Reference Architecture

5/29/14 9

•  Community Open Source •  Over $1 Billion invested in startups, all major sw companies embrace

•  3 replicas for HA, 50 PB+ scale

Apache Hadoop

5/29/14 10

Cluster

Slaves

Client Servers

✲ Hive, Pig, ...✲ cron+bash, Azkaban, …

Sqoop, Scribe, …Monitoring, Management

...

Secondary Master Server

Secondary Name Node

Primary Master Server

✲ Job Tracker

Name Node

Slave Server✲ Task Tracker

Data Node

DiskDiskDiskDiskDiskDiskDiskDisk


Data Node



Data Node


Copyright Think Big Analytics 2014. All rights reserved. 11

�  Support for many processing engines including emerging Predictive Analytics libraries

�  Common cluster with local data access in HDFS

YARN - Yet Another Resource Negotiator


�  Focus on technologies for prescriptive analytics -  Getting insights from your data

�  To get there you need -  data ingestion (Flume, Kafka, Sqoop, ...) -  data transformation (Pig, MapReduce, Hive, Cascading, …)

�  Beyond prescriptive, Machine Learned predictive analytics is interesting

Narrowing In


�  Scala-based more general distributed processing with long-lived memory caching

�  First released 2010, many contributors �  Scala, Java, Python APIs �  Largest production cluster to date is 80 nodes at Yahoo �  YARN Support �  Subprojects

-  Shark – Hive on Spark -  MLib – machine learning -  Spark Streaming -  GraphX

�  Pros: faster iterative analysis, growing distribution support �  Cons: less mature, Scala API foundation, smaller community

Apache Spark


�  DAG Processing of never ending streams of data -  First released 2011 -  Used at Twitter plus dozens of other companies -  Reliable - At Least Once semantics -  Think MapReduce for data streams -  Java / Clojure based -  Bolts in Java and ‘Shell Bolts’ -  Not a queue, but usually reads from a queue. -  YARN integration

�  Pros: community, HW support, low latency, high throughput �  Cons: static topologies & cluster sizing, Nimbus – SPOF, state

management

Apache Storm

Platform Trends

•  Hadoop will eat Big Data Analy<cs •  Hadoop YARN to run diverse applica<ons •  Real<me latency: Storm, Samza, HBase, Cassandra, Elas<csearch… •  Management: Ambari, Pepperdata…

•  Security: Knox, Sentry, XA Secure… •  Cloud: AWS, Qubole, Altascale, Info Chimps…

15 5/29/14

Analytics Trends

•  Real<me query engines: Hive/Tez, Shark, Impala, Presto… •  Machine Learning: Spark, MLib, Mahout, Orryx…

•  Metadata: HCatalog •  Analyst Tools: Tableau, Alpine Data Labs, Plaoora, Zoomdata…

•  Sensemaking: Data Tamer, Trifacta, Paxta…

16 5/29/14

Agenda

•  Why Big Data •  Technology Capabili<es •  OrganizaJonal EvoluJon •  Case Study •  Conclusions

5/29/14 17

How To Get the Right Answers

Data

Weak intersection between Data, Systems and Business often means project failure

Current State Improved State Collaborative integration of Analytics with Big Data capabilities for success.

Data

Weak Area with diminished influence

Cross-functional collaboration

Data-driven business alignment

5/29/14 18

Analytics and Data Science

Analy<cs is the umbrella term for discovery and communica<on of signal in data. •  Intelligence and ReporEng: Governed insights for general audiences by providing summaries and visualiza<ons over well understood data

•  DescripEve AnalyEcs: Interpretable data stories to domain consumers (e.g. business, sales, etc.) using exploratory data analysis and descrip<ve models (e.g. clustering) over well understood data

•  PredicEve AnalyEcs: Predicts future state or assesses impact of changing parameters for domain consumers (e.g. business, sales, etc.) using sta<s<cal modeling over well understood data

•  Data Science: Uncovers new or missing signals and rela<onships in data to be leveraged by other analy<cs. Supported with a mix of exploratory data analysis and sta<s<cal modeling over unexplored data or new combina<ons of known data. New capability driven by Big Data, currently overhyped and overused.

End-‐goal: effecEve and repeatable integraEon of analyEcs into business processes.

5/29/14 19

Basic Repor<ng

Big Data Repor<ng

Data Science Hub

Business Analy<cs

Data Driven Organiza<on

Analy<c Integra<on

High

Low

Big Data Capability Limited Full

Analytics Capability Model

5/29/14 20

Data Science Hub

Data Driven Organiza<on

Analy<c Integra<on

High

Low

Big Data Capability Limited Full

•  Redundant infrastructure •  Data locked down w/in product teams •  Analy<c innova<ons not shared

•  Share data internally •  Support joining of data •  Make analy<cs capabili<es

accessible to all business lines.

Client Example – Internet Infrastructure Firm

5/29/14 21

Agenda

•  Why Big Data •  Technology Capabili<es •  Organiza<onal Evolu<on •  Case Study •  Conclusions

5/29/14 22

•  “Reduce Data Search Par<es” •  Stop playing “Where’s Waldo with your Data”

•  “ I know I have that data….. somewhere?” •  Data Aggrega<on to a Common Plaoorm with common access tools

•  Improve Yields by Accessing More Data in a More Timely Manner

•  End to end visibility for every test, every diagnos<c and all info from all components of a product (internal and external)

•  Speed up yield improvement ramp up on new products •  Improve steady state yield on exis<ng products.

Use Cases

23 5/29/14

Big Data Platform

Devices

Components

Assemblies

…

Field Data

Supplier

All raw parametric, logistic, vintage, data

ERP/DW’s

Data Sources

End-‐to-‐End Integrated

Data

Consumers Ad hoc Analysis

BI Tools

Op<mize/Reduce Tes<ng

Data Marts

Batch AnalyJcs

Parallelized batch analytics

App-Specific Views

New High-Value Parameters

raw extracts

Enriched data

Proac<ve Issue Iden<fica<on

Failure Screen Tests

Customer FA via Field Data

Predic<ve Analy<cs Tools

…other ..

5/29/14 24

Highlights

•  Enterprise-‐wide data access for <mely analy<cs and insights •  Founda<on for large scale proac<ve analy<cs

Legacy BDP

Reten<on 3-‐6 months scaeered DBs then tape archive

All data online in common data lake for 3+ years

Coverage Summaries, samples for several data sets

All parametric data captured in raw and integrated form

Analysis Reac<ve resul<ng in missed improvement opportuni<es

In daily opera<ons and on larger data sets to support proac<ve improvements

25 5/29/14

•  Metrics: •  Collec<ng >2M manufacturing/tes<ng binary files daily •  Collec<ng from ~500 tables across 6 databases à tens of millions of records daily •  Over 140 users to date in early pilo<ng •  Over 150 aeendees par<cipated in BDP training

•  Highlights: although early in the overall journey, the BDP is already demonstra<ng early benefits:

Key Metrics & Highlights

26

Development Engineer: demonstrated the joining of data sets for detailed logistics tracking—analyses that is very difficult to conduct with current systems

Ops Engineer: a recent production issue required detailed historical data. Current systems did not have the required retention for this data. However, the team was able to pull the data from the BDP in minutes, as opposed to 3+ weeks to pull the data from tape archive

Development Engineer: obtained technical data from the BDP in hours as opposed to 3+ weeks to pull from tape archive

DATA SEARCH PARTIES

YIELD

5/29/14

Governance – Project Prioritization

New Data Sets

Use Cases from

Strategy

New Apps /

Reports

New Analytics

Working Steering

Committee

Executive Steering

Committee

1.  New Data Set 1 2.  New Use Case 1 3.  Use Case from

Strategy

Prioritized List

Business Experts

Business impact vs. ease of delivery, etc.

Nominated by…

Prioritized by…

Core Platform Updates

Sponsored by…

27 5/29/14

Release Cycles

July Aug Sept Oct Nov Dec Oct Jan Sept Feb

Today

Mar Apr May

Big Data Plaoorm Roadmap Jun July Aug

Release 1b: all ac<ve

products

Release 1a: base plaoorm, first

product

Release 2: Field/RMA

Data

Quality Development Operations Core/Other

Release 3 Release 4 Release 5

Reduce Scrap

New Product Data Access

New Data Sets

Ongoing Support

Faster Response to Field Requests

Correlated Analysis

Improved Failure Analyses

Delivering new

capabilities

to the business every

45 – 90 days

28 5/29/14

Agenda

•  Why Big Data •  Technology CapabiliJes •  Organiza<onal Evolu<on •  Case Study •  Conclusions

5/29/14 29

Stages of Big Data Adoption

Contain Costs & Scale Big Data Reporting

Agile Analytics & Insight Innovation 360o view Speed Accuracy Data Science Hub

Business Innovation Optimization Engagement Data Science Automation

Organizational Transformation New products Customer analytics Operational efficiency New businesses Data Driven Organization

Sustained competitive advantage through Big Analytics

Traditional B.I.

5/29/14 30

Pitfalls

•  Data Poli<cs •  Immature Governance •  Siloed Organiza<on •  Siloed Applica<ons •  Buy-‐in •  Leaping over phases / Bypassing low hanging fruit •  Disrup<ve Change •  Insufficient funding •  Big Bang Rollout •  Skills Gap •  Tac<cal use cases/the big data plateau •  Business as usual

5/29/14 31

Conclusions

•  Big Data is affec<ng all companies especially in tech •  Start with low hanging fruit •  Establish a roadmap for incremental organiza<onal evolu<on and technical capabili<es

5/29/14 32