Upload
rommel-garcia
View
664
Download
2
Tags:
Embed Size (px)
Citation preview
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Realtime Analytics in HadoopRommel Garcia – Solution EngineerOctober 10, 2014
Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop provides
• Terabytes to Petabytes of storage on commodity hardware (HDFS)• Massive parallel computation on enormous amount of data (YARN)
Hadoop is essentially a supercomputer for the masses!
Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDFS: Scalable, Reliable, Secure Storage Platform
HDFS (Hadoop Distributed File System)
YARN: Data Operating System
C A B C B B A C
B A B A C A
ReliableHighly Available &Fault Tolerant
Protects against data loss & corruption
Cost EffectiveHorizontally scales on Commodity Hardware
SecureStrong access controls, integrated with authentication mechanisms
Granular data access controls to datasets across users and groups
Standards Based Data Interfaces
NFSSource/
Destination
REST
RPC
Source/Destination
Source/Destination
The Storage Platform for the Modern Data Architecture
Ingest and store any data in any format
Flexible read access enables a variety of work loads
Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop 1
Single Use Data PlatformBatch
HADOOP 1
Redundant, Reliable Storage(HDFS)
Mapreduce
Hive PigJava
Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
20092006
1 ° ° ° ° °
° ° ° ° ° N
HDFS (Hadoop Distributed File System)
MapReduceLargely Batch Processing
Hadoop w/ MapReduce
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS (Hadoop Distributed File System)
Hadoop2 & YARN based Architecture
Silo’d clustersLargely batch systemDifficult to integrate
MR-279: YARN
Hadoop 2 & YARN
Interactive Real-TimeBatch
Enabled the Modern Data Architecture
October 23, 2013
Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop
Multi Use Data PlatformBatch, Interactive, Realtime, Online, Streaming, …
HADOOP 2
Redundant, Reliable Storage(HDFS)
Efficient Cluster Resource Management & Shared Services
(YARN)
Standard QueryProcessing
Hive
BatchMapReduce
Online Data Processing
InteractiveTez
Real Time Stream Processing Others
Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
AP
PL
ICA
TIO
NS
DA
TA S
YS
TE
M
Business Analytics
Custom Applications
PackagedApplications
Traditional systems under pressure
• Silos of Data
• Costly to Scale
• Constrained Schemas
Clickstream
Geolocation
Sentiment, Web Data
Sensor, Machine Data (IoT)
Unstructured docs, emails
Server logs
SO
UR
CE
S
Existing Sources (CRM, ERP,…)
RDBMS EDW MPP
New Data Types
…and difficult to manage new data
Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop 2 and YARN enable the Modern Data Architecture
Common data set, multiple applications
• Optionally land all data in a single cluster
• Batch, interactive & real-time use cases
• Support multi-tenant access, processing & segmentation of data
YARN: Architectural center of Hadoop
• Consistent security, governance & operations
• Ecosystem applications run natively in Hadoop
SO
UR
CE
S
EXISTING Systems
Clickstream Web &Social
Geolocation Sensor & Machine
Server Logs
Unstructured
AP
PL
ICA
TIO
NS
DA
TA S
YS
TE
M
Business Analytics
Custom Applications
PackagedApplications
RDBMS EDW MPP YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° N
HDFS (Hadoop Distributed File System)
Interactive Real-TimeBatch
Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Realtime Analytics in…
$
• Proactive Maintenance• Fraud Detection/Prevention • Cell tower diagnostics• Bandwidth Allocation
• Brand Sentiment Analysis• Localized, Personalized
Promotions
Financial Services
Retail Telecom Manufacturing
HealthcareUtilities, Oil & Gas
Public Sector
• Monitor patient vitals• Patient care and safety• Reduce re-admittance rates
• Smart meter stream analysis
• Proactive equipment repair• Power and consumption
matching
• Network intrusion detection and prevention
• Disease outbreak detection
• Unsafe driving detection and monitoring
Transportation
Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Truck Demo: Real-Time Analytics
Problem:• The only way to measure “safe driving” is through accident
occurences.• There’s no realtime accident prevention mechanism in place
Solution:• Use Hadoop to analyze driving violations in real-time• Provide a UI to view to real-time violation alerts• Provide a dashboard to review violation reports
Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Truck Demo Real-Time Hadoop Architecture
Truck EventsKafka
Storm
HBaseHDFS/HiveMessage Queue
(ActiveMQ)Real-Time
Monitoring App
Solr(Reporting Dashboard)
ViolationsAlertsTruck Event Data
High Speed Ingestion
Distributed Processing
Show
Show Driving Report
Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop 2.0Rommel Garcia – Solution EngineerOctober 10, 2014
Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop 2 delivers a comprehensive data management platform
Hadoop 2 Platform
Provision, Manage & Monitor
AmbariZookeeper
Scheduling
Oozie
Data Workflow, Lifecycle & Governance
FalconSqoopFlumeNFS
WebHDFSYARN: Data Operating System
DATA MANAGEMENT
SECURITYBATCH, INTERACTIVE & REAL-TIME
DATA ACCESSGOVERNANCE
& INTEGRATION
AuthenticationAuthorizationAccounting
Data Protection
Storage: HDFSResources: YARNAccess: Hive, … Pipeline: Falcon
Cluster: Knox
OPERATIONS
Script
Pig
Search
Solr
SQL
Hive HCatalog
NoSQL
HBaseAccumulo
Stream
Storm
Others
ISV Engines
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
°
N
HDFS (Hadoop Distributed File System)
In-Memory
Spark
Deployment Choice
Linux Windows On-Premise
Cloud
YARN is the architectural center of Hadoop 2
• Enables batch, interactive and real-time workloads
• Single SQL engine for both batch and interactive
• Enable existing ISV apps to plug directly into Hadoop via YARN
Provides comprehensive enterprise capabilities
• Governance
• Security
• Operations
The widest range of deployment options
• Linux & Windows
• On premise & cloud
TezTez
Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
YARN Development Framework
System
Engine
API
YARN : Data Operating System
°1 ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°°
° ° ° ° ° ° °
° ° ° ° ° ° N
HDFS (Hadoop Distributed File System)
BatchMapReduce
Real-TimeSlider
Direct
Java.NET
Scripting
Pig
SQL
Hive
Cascading
JavaScala
NoSQL
HBaseAccumulo
Stream
Storm
OtherISV
OtherISV
Applications
Others
Spark Other ISV
New New
New New
NewTezTezTez Tez
Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
YARN General Store – The Future
• A Data Lake that has a General Store to continually serve you….– App Store – YARN Ready Applications– Data Store – Where do I get the interesting data…Weather, Geo, ..etc.– View Store – How do I get UI’s to the cluster– Processing Store – Falcon, Pig...etc. for “standard” data sets or common “processing
patterns”
Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Argus: Security needs are changing
AdministrationCentrally management & consistent security
AuthenticationAuthenticate users and systems
AuthorizationProvision access to data
AuditMaintain a record of data access
Data ProtectionProtect data at rest and in motion
Security needs are changing• YARN unlocks the data lake
• Multi-tenant: Multiple applications for data access
• Changing and complex compliance environment
• ETL of non-sensitive data can yield sensitive data
Summer 201465% of clusters host multiple workloads
Fall 2013Largely silo’d deployments with single workload clusters
5 areas of security focus
Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Security in Hadoop with HDP + Argus (XA Secure)
AuthorizationRestrict access to explicit data
AuditUnderstand who did what
Data ProtectionEncrypt data at rest & in motion
• Kerberos in native Apache Hadoop
• HTTP/REST API Secured with Apache Knox Gateway
• HDFS Permissions, HDFS ACL,• Audit logs in with HDFS & MR• Hive ATZ-NG
AuthenticationWho am I/prove it?
• Wire encryption in Hadoop
• Open Source Initiatives
• Partner Solutions
• HDFS, Hive and Hbase
• Fine grain access control
• RBAC
• Centralized audit reporting
• Policy and access history
• Future Integration
Had
oop
2A
rgus
Centralized Security Administration
• As-Is, works with current authentication methods
Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hive: The De-Facto SQL Interface for Hadoop
Page 27
Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Abstractions in Hive
Page 28
Partitions, buckets and skews facilitatefaster, more direct data access. Cube, windowing, aggregation
functions supported as well
Database
Table Table
Partition Partition Partition
Bucket
Bucket
BucketOpti
onal
Per
Tab
le
Skewed KeysUnskewed Keys
Page 33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Pipeline Tracing
Purchase feed
Customer feed
Product feedStore feed
View dependencies between clusters, datasets
and processes
Data pipeline dependencies
Add arbitrary tags to feeds & processes
Credit
feed
Sensitive encrypted
Data pipeline tagging
Know who modified a dataset when and into
what
Data pipeline audits
File-1
File-2
File-3
Analyze how a dataset reached a particular
state
Data pipeline lineage
Page 34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Example: Multi-Cluster Replication
Raw DataPresented
DataCleansed
DataConformed
Data
Staged DataPresented
Data
Rep
licat
ion
Failover Hadoop Cluster
Primary Hadoop Cluster
Rep
licat
ion
Bi and Analytic Applications
• Falcon manages workflow and replication• Enables business continuity without requiring full data reprocessing• Failover clusters can be smaller than primary clusters
..and many more
Page 35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Example: Retention
Staged DataPresented
DataCleansed
DataConformed
Data
Retain 5 Years
Retain Last Copy Only
Retain 3 Years
Retain 3 Years
• Sophisticated retention policies expressed in one place• Simplify data retention for audit, compliance, or for data re-processing
Ret
entio
n P
olic
y
Page 38 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Ambari 2H 2014
1.7.0 (September) 1.8.0 (October) 2.0.0 (December)
Features• Config versioning + history • Config <final> Properties• Flume Support • Ubuntu Support • ResourceManager HA • HDFS Rebalance • Ambari Views Framework• Slider Support
Tech Preview• Windows Support• Ambari Shell
Features• ServiceX on YARN via Slider• Log Access + Search • Rack Awareness • Simplified Kerberos Setup• NameNode SafeMode • Ambari Shell GA
Features• Automated Rolling Upgrades• Oozie HA • Ambari Alerts • Ambari Metrics • Windows Support GA
Page 40 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Efficient Data Lakes can Span to the Cloud
On-Premises Cloud
HDP on Windows
HDP on Linux
Your deployment of Hadoop
hosted as a VM in Azure
HDP on Windows
HDP on Linux
Full control of HW and
software configs
Analytics Platform System
Turnkey Hadoop and
relational warehouse appliance
HDInsight
Managed Hadoop Service
Built on Azure storage
Enjoy cross-platform interoperability based on 100% open source HDP
1 2
3 4