View
564
Download
1
Embed Size (px)
DESCRIPTION
Ron Bodkin of Think Big Analytics talks about trends in Big Data space
Citation preview
Big Data Beyond the Hype: Trends, Pi6alls and Building Competency
May 2014 1 5/29/14
Agenda
• Why Big Data • Technology Capabili<es • Organiza<onal Evolu<on • Case Study • Conclusions
5/29/14 2
Big Data
5/29/14 3
• Data sets so large and complex that they become awkward to work with using standard tools and techniques
• Google, Quantcast, Yahoo, LinkedIn… customer innova<on in open source
Patterns from Delivering Big Data at Scale
eCommerce 2 of Global Top 5
Retail 2 of Global Top 5
Social Networking Global #1
Banking 4 of Global Top 10
Credit Issuer 2 of Global Top 5
Financial Data Services 2 of Global Top 5
Financial Exchanges Global #2
Brokerage & Mutual Funds 2 of Global Top 5
Asset Management Global #1
Semiconductor 2 of Global Top 5
Data Storage Devices 3 of Global Top 5
Disk Drive Manufacturing Global #1
TelecommunicaJons 2 of Global Top 5
Media & AdverJsing 2 of Global Top 4
Internet TransacJon Security Global #1
5/29/14 4
Manufacturing Analytics
1 Improve Yield -‐ reduced scrap, faster <me to market
2 Ease data access – quick search access to all relevant data for all history in one place, not samples, summary or recent data
3 Process Efficiency – iden<fy redundant tests, op<mize throughput
4 Component Analysis – proac<ve discovery of component problems, compa<bility detec<on
5 Service Revenue GeneraJon -‐ Leverage customer support to generate new revenues by increasing customer reten<on, support renewals, and upgrading customers to new products and licenses.
5 5/29/14
Product Analytics
1 Issue ResoluJon -‐ Faster resolu<on of issues through deep analysis of integrated device log, product, and configura<on data across the install base.
2 Support Metrics -‐ Gain new insights into dimensions impac<ng service metrics by combining product use technical data with business data. E.g., number of <ckets resolved, length of calls, resource alloca<on, customer sa<sfac<on.
3 ProacJve Service -‐ Iden<fy and address poten<al issues rather than wai<ng to react to them aXer they happen. I.e., intelligence for condi<on-‐based maintenance.
4 Customer Self-‐Service -‐ Drive higher rates of customer self-‐service by providing customers access to tools and resources to answer their own ques<ons and solve their own issues by analyzing past customer ac<vity and integra<ng technical details from customer systems into self-‐service portals.
5 Service Revenue GeneraJon -‐ Leverage customer support to generate new revenues by increasing customer reten<on, support renewals, and upgrading customers to new products and licenses.
6 5/29/14
Clickstream Analytics
1 Timely Processing -‐ Data volumes are increasing clogging data warehouse, slow and out of date.
2 Cost Efficiency & Scale -‐ Data volumes are increasing clogging data warehouse, slow and out of date.
3 Flexible Tagging – Analyze site without having to push data to tags, classify ac<vity based on sec<ons, URLs more flexibly than tags.
4 Converged AnalyJcs -‐ Understand and improve web, mobile, cross-‐channel customer ac<vity. Siloed apps costly & inflexible – e.g., tags require feeding data about user demographics, segments. Mash up interac<ons with data like geographic, demographic, social, call center, purchases.
5 Deeper ExploraJon -‐ Analyze sequences of events in customer journeys. Deep drill down (how do users with the latest iPhone and iOS compare to latest Samsung and Android). Cohort analysis -‐ those who buy books vs spor<ng goods.
6 PredicJve Models – next best offer recommenda<ons, personaliza<on, customer value, microsegmenta<on, ad aeribu<on…
7 5/29/14
Agenda
• Why Big Data • Technology CapabiliJes • Organiza<onal Evolu<on • Case Study • Conclusions
5/29/14 8
Big Data Reference Architecture
5/29/14 9
• Community Open Source • Over $1 Billion invested in startups, all major sw companies embrace
• 3 replicas for HA, 50 PB+ scale
Apache Hadoop
5/29/14 10
Cluster
Slaves
Client Servers
✲ Hive, Pig, ...✲ cron+bash, Azkaban, …
Sqoop, Scribe, …Monitoring, Management
...
Secondary Master Server
Secondary Name Node
Primary Master Server
✲ Job Tracker
Name Node
Slave Server✲ Task Tracker
Data Node
DiskDiskDiskDiskDiskDiskDiskDisk
Slave Server✲ Task Tracker
Data Node
DiskDiskDiskDiskDiskDiskDiskDisk
Slave Server✲ Task Tracker
Data Node
DiskDiskDiskDiskDiskDiskDiskDisk
Copyright Think Big Analytics 2014. All rights reserved. 11
� Support for many processing engines including emerging Predictive Analytics libraries
� Common cluster with local data access in HDFS
YARN - Yet Another Resource Negotiator
Copyright Think Big Analytics 2014. All rights reserved. 12
� Focus on technologies for prescriptive analytics - Getting insights from your data
� To get there you need - data ingestion (Flume, Kafka, Sqoop, ...) - data transformation (Pig, MapReduce, Hive, Cascading, …)
� Beyond prescriptive, Machine Learned predictive analytics is interesting
Narrowing In
Copyright Think Big Analytics 2014. All rights reserved. 13
� Scala-based more general distributed processing with long-lived memory caching
� First released 2010, many contributors � Scala, Java, Python APIs � Largest production cluster to date is 80 nodes at Yahoo � YARN Support � Subprojects
- Shark – Hive on Spark - MLib – machine learning - Spark Streaming - GraphX
� Pros: faster iterative analysis, growing distribution support � Cons: less mature, Scala API foundation, smaller community
Apache Spark
Copyright Think Big Analytics 2014. All rights reserved. 14
� DAG Processing of never ending streams of data - First released 2011 - Used at Twitter plus dozens of other companies - Reliable - At Least Once semantics - Think MapReduce for data streams - Java / Clojure based - Bolts in Java and ‘Shell Bolts’ - Not a queue, but usually reads from a queue. - YARN integration
� Pros: community, HW support, low latency, high throughput � Cons: static topologies & cluster sizing, Nimbus – SPOF, state
management
Apache Storm
Platform Trends
• Hadoop will eat Big Data Analy<cs • Hadoop YARN to run diverse applica<ons • Real<me latency: Storm, Samza, HBase, Cassandra, Elas<csearch… • Management: Ambari, Pepperdata…
• Security: Knox, Sentry, XA Secure… • Cloud: AWS, Qubole, Altascale, Info Chimps…
15 5/29/14
Analytics Trends
• Real<me query engines: Hive/Tez, Shark, Impala, Presto… • Machine Learning: Spark, MLib, Mahout, Orryx…
• Metadata: HCatalog • Analyst Tools: Tableau, Alpine Data Labs, Plaoora, Zoomdata…
• Sensemaking: Data Tamer, Trifacta, Paxta…
16 5/29/14
Agenda
• Why Big Data • Technology Capabili<es • OrganizaJonal EvoluJon • Case Study • Conclusions
5/29/14 17
How To Get the Right Answers
Data
Weak intersection between Data, Systems and Business often means project failure
Current State Improved State Collaborative integration of Analytics with Big Data capabilities for success.
Data
Weak Area with diminished influence
Cross-functional collaboration
Data-driven business alignment
5/29/14 18
Analytics and Data Science
Analy<cs is the umbrella term for discovery and communica<on of signal in data. • Intelligence and ReporEng: Governed insights for general audiences by providing summaries and visualiza<ons over well understood data
• DescripEve AnalyEcs: Interpretable data stories to domain consumers (e.g. business, sales, etc.) using exploratory data analysis and descrip<ve models (e.g. clustering) over well understood data
• PredicEve AnalyEcs: Predicts future state or assesses impact of changing parameters for domain consumers (e.g. business, sales, etc.) using sta<s<cal modeling over well understood data
• Data Science: Uncovers new or missing signals and rela<onships in data to be leveraged by other analy<cs. Supported with a mix of exploratory data analysis and sta<s<cal modeling over unexplored data or new combina<ons of known data. New capability driven by Big Data, currently overhyped and overused.
End-‐goal: effecEve and repeatable integraEon of analyEcs into business processes.
5/29/14 19
Basic Repor<ng
Big Data Repor<ng
Data Science Hub
Business Analy<cs
Data Driven Organiza<on
Analy<c Integra<on
High
Low
Big Data Capability Limited Full
Analytics Capability Model
5/29/14 20
Data Science Hub
Data Driven Organiza<on
Analy<c Integra<on
High
Low
Big Data Capability Limited Full
• Redundant infrastructure • Data locked down w/in product teams • Analy<c innova<ons not shared
• Share data internally • Support joining of data • Make analy<cs capabili<es
accessible to all business lines.
Client Example – Internet Infrastructure Firm
5/29/14 21
Agenda
• Why Big Data • Technology Capabili<es • Organiza<onal Evolu<on • Case Study • Conclusions
5/29/14 22
• “Reduce Data Search Par<es” • Stop playing “Where’s Waldo with your Data”
• “ I know I have that data….. somewhere?” • Data Aggrega<on to a Common Plaoorm with common access tools
• Improve Yields by Accessing More Data in a More Timely Manner
• End to end visibility for every test, every diagnos<c and all info from all components of a product (internal and external)
• Speed up yield improvement ramp up on new products • Improve steady state yield on exis<ng products.
Use Cases
23 5/29/14
Big Data Platform
Devices
Components
Assemblies
…
Field Data
Supplier
All raw parametric, logistic, vintage, data
ERP/DW’s
Data Sources
End-‐to-‐End Integrated
Data
Consumers Ad hoc Analysis
BI Tools
Op<mize/Reduce Tes<ng
Data Marts
Batch AnalyJcs
Parallelized batch analytics
App-Specific Views
New High-Value Parameters
raw extracts
Enriched data
Proac<ve Issue Iden<fica<on
Failure Screen Tests
Customer FA via Field Data
Predic<ve Analy<cs Tools
…other ..
5/29/14 24
Highlights
• Enterprise-‐wide data access for <mely analy<cs and insights • Founda<on for large scale proac<ve analy<cs
Legacy BDP
Reten<on 3-‐6 months scaeered DBs then tape archive
All data online in common data lake for 3+ years
Coverage Summaries, samples for several data sets
All parametric data captured in raw and integrated form
Analysis Reac<ve resul<ng in missed improvement opportuni<es
In daily opera<ons and on larger data sets to support proac<ve improvements
25 5/29/14
• Metrics: • Collec<ng >2M manufacturing/tes<ng binary files daily • Collec<ng from ~500 tables across 6 databases à tens of millions of records daily • Over 140 users to date in early pilo<ng • Over 150 aeendees par<cipated in BDP training
• Highlights: although early in the overall journey, the BDP is already demonstra<ng early benefits:
Key Metrics & Highlights
26
Development Engineer: demonstrated the joining of data sets for detailed logistics tracking—analyses that is very difficult to conduct with current systems
Ops Engineer: a recent production issue required detailed historical data. Current systems did not have the required retention for this data. However, the team was able to pull the data from the BDP in minutes, as opposed to 3+ weeks to pull the data from tape archive
Development Engineer: obtained technical data from the BDP in hours as opposed to 3+ weeks to pull from tape archive
DATA SEARCH PARTIES
YIELD
5/29/14
Governance – Project Prioritization
New Data Sets
Use Cases from
Strategy
New Apps /
Reports
New Analytics
Working Steering
Committee
Executive Steering
Committee
1. New Data Set 1 2. New Use Case 1 3. Use Case from
Strategy
Prioritized List
Business Experts
Business impact vs. ease of delivery, etc.
Nominated by…
Prioritized by…
Core Platform Updates
Sponsored by…
27 5/29/14
Release Cycles
July Aug Sept Oct Nov Dec Oct Jan Sept Feb
Today
Mar Apr May
Big Data Plaoorm Roadmap Jun July Aug
Release 1b: all ac<ve
products
Release 1a: base plaoorm, first
product
Release 2: Field/RMA
Data
Quality Development Operations Core/Other
Release 3 Release 4 Release 5
Reduce Scrap
New Product Data Access
New Data Sets
Ongoing Support
Faster Response to Field Requests
Correlated Analysis
Improved Failure Analyses
Delivering new
capabilities
to the business every
45 – 90 days
28 5/29/14
Agenda
• Why Big Data • Technology CapabiliJes • Organiza<onal Evolu<on • Case Study • Conclusions
5/29/14 29
Stages of Big Data Adoption
Contain Costs & Scale Big Data Reporting
Agile Analytics & Insight Innovation 360o view Speed Accuracy Data Science Hub
Business Innovation Optimization Engagement Data Science Automation
Organizational Transformation New products Customer analytics Operational efficiency New businesses Data Driven Organization
Sustained competitive advantage through Big Analytics
Traditional B.I.
5/29/14 30
Pitfalls
• Data Poli<cs • Immature Governance • Siloed Organiza<on • Siloed Applica<ons • Buy-‐in • Leaping over phases / Bypassing low hanging fruit • Disrup<ve Change • Insufficient funding • Big Bang Rollout • Skills Gap • Tac<cal use cases/the big data plateau • Business as usual
5/29/14 31
Conclusions
• Big Data is affec<ng all companies especially in tech • Start with low hanging fruit • Establish a roadmap for incremental organiza<onal evolu<on and technical capabili<es
5/29/14 32