Upload
hadoop-summit
View
7.101
Download
3
Embed Size (px)
Citation preview
Evolution of Big Data at Intel - crawl, walk and run approach Gomathy Bala | DirectorChandhu Yalla | Manager & Architect
Key Contributors: Sonja Sandeen, Seshu Edala, Nghia Ngo and Darin Watson
IT BI Big Data Team
Copyright © 2014, Intel Corporation. All rights reserved.
Legal NoticesThis presentation is for informational purposes only. INTEL MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.
The content in this presentation is being shared Under NDA.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others.Copyright © 2014, Intel Corporation. All rights reserved.
2
Copyright © 2014, Intel Corporation. All rights reserved.3
Agenda
• Intel IT Big Data Journey• Enterprise DW architecture• BI Big Data 3 yr Roadmap• Big Data Ecosystem Architecture• Platform Strategies & BKMs• Summary
Copyright © 2014, Intel Corporation. All rights reserved.4
20112012 2013 2014 2015
Intel IT Big Data Journey
Big Data &
Analytics Strategy
Production Online
Telmap: 1st Use Case
Preproduction Online
Hadoop Evaluatio
n IDH to CDHHadoop 2.0
$176M BVProduction: Security BI,
Attribute Reduction System, ATM Ellipses
Engine, IAH-Retail Analytics
6 Environments
CDH 5.3
4 Use Cases in Preproduction
12 POC Use Cases
6 Use Cases in Production
$290K investment$948/TB
3 Use Cases in ProductionSmart-What,
Marketing-IAH, Incident Predictability
$6M BV
CDH 5.1
IAH – Cloud CRMIn Production
Enterprise Standards, Guidance,
Processes for Platform & Capabilities
15 Active Use Cases | $290K + 10.5 HC Investment | Delivered $182M BV
Copyright © 2014, Intel Corporation. All rights reserved.5
Big Data & Analytics Really Delivers!
From 2014 – 2015 Intel IT Business Review – Annual Edition
Kim's Video
Copyright © 2014, Intel Corporation. All rights reserved.6
Any Data Source
ERP
In Memory Real-Time Data Platform
CRMSCMSRM
ECC
BWECCW Real-Time & Self Service Analytics Platform
MDG
NW
Teradata Cloudera Hadoop Data Lake
Reporting Tools
Data TieringHot-Cold data
EnterpriseData Warehouse
Other Apps
Custom
Intel
…
NRT
Predictive Analytics
BPCBCS
Cloud BI
SaaS
NewApps
.
DownstreamApplications
2014-2017 Vision: Real-Time Enterprise
Copyright © 2014, Intel Corporation. All rights reserved.
FE Tools
CLS/Proxy
High speed data loaderBig
Dat
a
• Machine Learning• Log Processing• Unstructured data
Use Cases• High volume counter Analytics• Text Parsing/Mining
• Strategic/Operational reporting
• Interactive Reporting
Use Cases• High Concurrent user analytics -
Supply/Order• Mission critical analytics –
Finance/HRFu
ture
SQL on Hadoop
Enterprise Data Architecture with Hadoop and Other MPP DWH Current & Future Strategy Future Present
EDWMfg Data
A %ge of Traditional BI use cases
IMT
Copyright © 2014, Intel Corporation. All rights reserved.8
BI Big Data | 3-Year RoadmapBig Data + AA Big Data + SSAA + Traditional BI
Big Data + SSAA + Traditional BI
2015
2016
2017
Scalable and well designed
Hadoop Platform
Evolve IMT + Hadoop Data Lineage & Data
Catalog Streaming
Capabilities Advanced SQL on
Hadoop ACID semantics
Evolve Big Data + SSAA per ecosystem roadmaps
BC/DR End to end enterprise
features Enterprise ready: OLAP
and Traditional DW
Hadoop is an open source framework designed for big data analytics.
Hadoop is evolving rapidly, but it will still take a couple of years for it to mature and support “traditional bi” use cases.
LegendOrange Text: Traditional BI Capabilities Green Text: Big Data/AA Capabilities
Security (RBAC, ITS/IRS) Data Governance Data Discovery Self Service AA
Framework IMT + Hadoop AVP + Hadoop In-memory + Near real
time capabilities SQL on Hadoop
Copyright © 2014, Intel Corporation. All rights reserved.9
Data Integration
Big Data Platform – Ecosystem Architecture & Maturity
NRT/Stream Processing
In-Memory Processing
Processing Layer Batch Processing
Data Virtualization Data DiscoveryAdv. AnalyticsAdv. Visualization Data Management
Presentation Layer
End User Data Steward
Business Analyst
Data Scientist
DeveloperUser layer Auditor
Machine Learning
Analytical layer Statistical
Numerical Time series
Textual/Log Spatial
Graph
Textual/Log DB Hierarchy DBRelational DB Graph DBStorage Model
Platform VirtualizationInfrastructure Platform Management Network Management Systems Management
Data Ingestion
Continuous IntegrationDev Framework Security
Source/Target APIs 3rd Party Drivers
Ent. Scheduler Srvs Metadata MgmtWorkload Mgmt
Middleware
*Other names and brands may be claimed as the property of others.
Columnar DB
Data Egression
Other Vendors offered capabilities
Majority CDH offered capabilities
Data Consumption
Prescriptive Guidance
Change Release
GovernanceEngagem
ent
Service M
anagement
Training
Support
Processes
Copyright © 2014, Intel Corporation. All rights reserved.10
BI Big Data Platform
Hadoop Project Sandbox – CDH 5.3
Multiple Instances Deployed on Intel Cloud & MyCloud
environments. TTM to business: 2-3 Days
Hadoop Pre-Production – CDH 5.3
10 data nodes | 399TB | 320 vcoresUse cases in Dev/POC: 14
Hadoop Production – CDH 5.322 data nodes | 658TB | 704 vcores
Use cases Live in prod: 7
Hadoop 2.0 architecture provides reliability, scalability & performance
High availability and scalability design Well positioned to meet 2015 business use
case requirements Repeatable architecture for faster builds. Capacity additions: Add data node. White
boxes, Waterfall equipment or HP servers TTM: Varies depending on HW (3 wks-2
months)Job/Workflow Management
Data Node Data Node Data Node Data Node Data Node
Name NodeResource Mgr
Name NodeResource Mgr
heartbeat, balancing, replicationYARN
Scale to meet business needs
GatewayNodes
(NN hi-av)Gateway nodes
Login (ssh) : AD authentication & authorization, access
cluster, run HDFS commands, submit jobs, etc.
ManagementNode
Source Data
DB Data
VisualizationTools
Data Movement/ETL
EDW or Datamart
DB data
Unstructured Semi-structured
Copyright © 2014, Intel Corporation. All rights reserved.
• Skills and resources with time to ramp up• Starting small is ok. Focus on design and scalability for the platform. • Technical product evaluation
Stick with a distribution which is core Hadoop open source stack vs proprietary software• Security is a big deal to Intel, Big Data Security capabilities implementation is
key focus• Methodology to understand the data is to use an iterative discovery method
with technical, business and modeling teams. • Intel IT Big Data Journey benefited heavily from Cloudera partnership• Open source will play a big role in advancing Big Data capabilities and analytics
BKM’s | Summary
Copyright © 2014, Intel Corporation. All rights reserved.12
BI Big Data IT@Intel Resource InfoBI Big Data IT@Intel Resource Links: 1. Hadoop Migration Success Story: How Intel IT Moved to Cloudera2. Mining Big Data in the Enterprise for Better Business Intelligence3. Enabling Big Data Platforms and Solutions with Centralized Data Management4. Integrating Apache Hadoop* into Intel’s Big Data Environment5. Using a Multiple Data Warehouse Strategy to Improve BI Analytics
To learn more: www.intel.com/bigdata
Copyright © 2014, Intel Corporation. All rights reserved.13
Q & A
Intel Confidential — Do Not Forward
Copyright © 2014, Intel Corporation. All rights reserved.15
Backup
Copyright © 2014, Intel Corporation. All rights reserved.
Big Data Capability Catalog
Hive
HDFS MapReduceZookeeper
Pig Mahout
NetworkServers Storage Security OS Hi-AvEAM / AD Integration
HDFS Compress
WHIRR
Hbase
Governance
Change Release
Engagement
Service mgmt.
Prescriptive Guidance
Training
SQOOP JDBC Other DW
Infrastructure
Process
Cloudera* Distribution of Hadoop (CDH)
*Other names and brands may be claimed as the property of others.
Storm
Hcatalog
ACCUMULOYARN
SPARK
Autosys
SecureGIT
Impala JDBC
HiveODBC
3rd Party SW/Connectors Integration
HUE SOLRIMPALA
PARQUET DataFu
Impala ODBC
TDCH
Oozie
Kafka
Sqoop
DIGateway
Flume
SFTPSMBClient
DataIntegration
Camel
Enabled PlannedWIPAvail. Now 1-3 Months 3-6+ Months
Cloudera Manager*System Management
Cloudera Navigator*Data Management
Audit
Access Control
Discovery Explore
Lineage Lifecyle
DeploymentMonitoring Reporting Diagnostics
Alerting Service Management
Rolling Upgrades
Config Rollbacks
List includes only the capabilities planned for next 6 months.
16
Google Analytics
SFDC
Sentry
Copyright © 2014, Intel Corporation. All rights reserved.
i. Find Differences with a Comparative Evaluation in a Sandbox Environment
ii. Define Your Strategy for the Cloudera Implementation
iii. Split the Hardware Environment
iv. Upgrade the Hadoop Version
v. Create a Preproduction-to-Production Pipeline
vi. Rebalance the Data
Migration to Cloudera – 6 BKMs
Copyright © 2014, Intel Corporation. All rights reserved.
Building Block Strategy to Enterprise Security of Hadoop
Q1’15: Perimeter access with LDAP + finer grain controls with Sentry. The second building block towards enterprise grade security design.
Q2’15: Add Kerberos to enable more Hadoop components and further secure the platform
2H’15: Exploration starting, awaiting product and target to adopt in 2H’15 in Production.
Now
Q2’15
2H’15
Copyright © 2014, Intel Corporation. All rights reserved.19
Hadoop Maturity & Evolution
MapReduce(batch data processing,
cluster resource management)
HDFS 1.0(redundant, reliable
data storage)
Hadoop 1.0
YARN(cluster resource management)
HDFS 2.0(redundant, reliable data storage)
Interactive
(Impala)
In-Memory(Spark)
Batch(Map
Reduce)
Online(Hbase)
Others(Search, Storm
etc.)
Graph
Applications Run Natively In Hadoop
+ Scalable data storage and processing platform+ Positioned for Batch processing workloads for Map and Reduce only+ Apache Hive offers SQL like query language
- Lacks reliability and stability- No support for low latency queries
Apache YARN allows you to run multiple applications in Hadoop and provides reliability, scalability and performance
Advanced Resource Management Apache Hive offers a 50x improvement in performance for queries Cloudera Impala to support low latency query requirements with SQL-92 and SQL-
2000 support Data at Rest Encryption and Row Level/Cell Level Security planned Data Streaming and Search Capability GraphDB Expanded Data Governance IMT + Hadoop Integration Improved Front End tool integration/support Deeper Diagnostics for multiple components
2005 - 2012 2013 - 2014
Hadoop 2.0
HDFS(redundant, reliable
data storage)
YARN(cluster resource
management)
Batch(Map Reduce)
Others(data
processing)
2015 - 2017
Copyright © 2014, Intel Corporation. All rights reserved.20
2014 Intel IT Vital Statistics>6,300 IT employees
59 global IT sites
>98,000 Intel employees1
168 Intel sites in 65 Countries
64 Data Centers(91 Data Centers in 2010)80% of servers virtualized
(42% virtualized in 2010, goal of 75%)
>147,000+ Devices100% of laptops encrypted100% of laptops with SSD’s>43,200 handheld devices
57 mobile applications developed
Source: Information provided by Intel IT as of Jan 20141Total employee count does not include wholly owned subsidiaries that Intel IT does not directly support
Copyright © 2014, Intel Corporation. All rights reserved.
Copyright © 2014, Intel Corporation. All rights reserved.21
Big Data in the Industry Recommendation Engine Fraud Detection
Sentiment Analytics
Behavioral Targeting
Customer Experience Analytics
Marketing campaign Analytics
Copyright © 2014, Intel Corporation. All rights reserved.
Learn more about Intel IT’s Initiatives at www.intel.com/IT
Sharing Intel IT Best Practices With the World