Page 1 © Hortonworks Inc. 2014
Discover HDP 2.2: Apache Kafka & Apache Storm for Stream Data Processing
Hortonworks. We do Hadoop.
Page 2 © Hortonworks Inc. 2014
Speakers
Justin Sears
Hortonworks Product Marketing Manager
Rajiv Onat
Hortonworks Sr. Product Manager for Stream Data Processing
Taylor Goetz
Hortonworks Engineer, Apache Storm Committer & PMC Chair
Page 3 © Hortonworks Inc. 2014
Agenda
• Introduction to Apache Kafka and Apache Storm
• New Streaming Innovation in HDP 2.2 § Improved Connectivity
§ Developer Productivity
§ Security Enhancements
• Q & A
We’ll move quickly: • Attendee phone lines are muted • Text any questions to Taylor Goetz using Webex chat • Questions answered at the end
• Unanswered questions and answers in upcoming blog post
Page 4 © Hortonworks Inc. 2014
Big Data, Hadoop & Data Center Re-platforming
Business Drivers
• From reactive analytics to proactive interactions
• Insights that drive competitive advantage & optimal returns
Financial Drivers
• Cost of data systems, as % of IT spend, continues to grow
• Cost advantages of commodity hardware & open source software
$ Technical Drivers
• Data is growing exponentially & existing systems overwhelmed
• Predominantly driven by NEW types of data that can inform analytics
There is an inequitable balance between vendor and customer in the market
Page 5 © Hortonworks Inc. 2014
Clickstream Capture and analyze website visitors’ data trails and optimize your website
Sensors Discover patterns in data streaming automatically from remote sensors and machines
Server Logs Research logs to diagnose process failures and prevent security breaches
New Types of Data Hadoop Value:
Sentiment Understand how your customers feel about your brand and products – right now
Geographic Analyze location-based data to manage operations where they occur
Unstructured Understand patterns in files across millions of web pages, emails, and documents
Page 6 © Hortonworks Inc. 2014
A Shift from Reactive to Proactive Interactions
HDP and Hadoop allow organizations to use data to shift interactions from…
Reactive Post Transaction
Proactive Pre Decision
…to Real-time Personalization From static branding
…to repair before break From break then fix
…to Designer Medicine From mass treatment
…to Automated Algorithms From Educated Investing
…to 1x1 Targeting From mass branding
A shift in Advertising
A shift in Financial Services
A shift in Healthcare
A shift in Retail
A shift in Telco
Page 7 © Hortonworks Inc. 2014
Enterprise Goals for the Modern Data Architecture
• Consolidate siloed data sets structured and unstructured
• Central data set on a single cluster
• Multiple workloads across batch interactive and real time
• Central services for security, governance and operation
• Preserve existing investment in current tools and platforms
• Single view of the customer, product, supply chain
APP
LIC
ATIO
NS
DAT
A S
YSTE
M
Business Analytics
Custom Applications
Packaged Applications
RDBMS
EDW
MPP
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° N
Interactive Real-Time Batch CRM
ERP
Other 1 ° ° °
° ° ° °
HDFS (Hadoop Distributed File System)
SOU
RC
ES
EXISTING Systems
Clickstream Web &Social
Geoloca9on Sensor & Machine
Server Logs
Unstructured
Page 8 © Hortonworks Inc. 2014
YARN Transformed Hadoop & Opened a New Era
YARN The Architectural Center of Hadoop
• Common data platform, many applications
• Support multi-tenant access & processing
• Batch, interactive & real-time use cases
YARN: Data Operating System (Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Tez Tez
Java Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
Others
ISV Engines
HDFS (Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase Accumulo
Slider Slider
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Page 9 © Hortonworks Inc. 2014
YARN Extends Hadoop to Other Data Center Leaders
YARN The Architectural Center of Hadoop
• Common data platform, many applications
• Support multi-tenant access & processing
• Batch, interactive & real-time use cases
• Supports 3rd-party ISV tools
(ex. SAS, Syncsort, Actian, etc.)
YARN Ready Applications Facilitates ongoing innovation and enterprise adoption via ecosystem of new and existing “YARN Ready” solutions
YARN: Data Operating System (Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Tez Tez
Java Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
Others
ISV Engines
HDFS (Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase Accumulo
Slider Slider
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Page 10 © Hortonworks Inc. 2014
Enterprise Hadoop: Central Set of Services
YARN: Data Operating System (Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
° °
° °
° ° ° ° °
° ° ° ° °
Enables Apache Hadoop to be an Enterprise Data Platform with centralized services for:
• Governance
• Operations
• Security
Everything that plugs into Hadoop inherits these services
Provision, Manage & Monitor
Ambari
Zookeeper
Scheduling
Oozie
Load data and manage
according to policy
Deploy and effectively
manage the platform
Provide layered approach to
security through Authentication, Authorization,
Accounting, and Data Protection
SECURITY GOVERNANCE OPERATIONS
Script
Pig
SQL
Hive
Java Scala
Cascading
Stream
Storm
Search
Solr
NoSQL
HBase Accumulo
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Others
ISV Engines
YARN: Data Operating System (Cluster Resource Management)
HDFS (Hadoop Distributed File System)
Tez Slider Slider Tez Tez
Page 11 © Hortonworks Inc. 2014
Hortonworks Data Platform 2.2
HDP Delivers Enterprise Hadoop
YARN: Data Operating System (Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Tez Tez
Java Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
HDFS (Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase Accumulo
Slider Slider
SECURITY GOVERNANCE OPERATIONS BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Provision, Manage & Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow, Lifecycle & Governance
Falcon Sqoop Flume Kafka NFS
WebHDFS
Authentication Authorization
Audit Data Protection
Storage: HDFS
Resources: YARN Access: Hive
Pipeline: Falcon Cluster: Ranger Cluster: Knox
Deployment Choice Linux Windows Cloud
YARN is the architectural center of HDP
• Common data set across all applications
• Batch, interactive & real-time workloads
• Multi-tenant access & processing
Provides comprehensive enterprise capabilities
• Governance
• Security
• Operations
Enables broad ecosystem adoption
• ISVs can plug directly into Hadoop
The widest range of deployment options • Linux & Windows
• On premises & cloud
Others
ISV Engines
On-Premises
Page 12 © Hortonworks Inc. 2014
Hortonworks Data Platform 2.2
HDP Delivers Enterprise Hadoop
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Tez Tez
Java Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
HDFS (Hadoop Distributed File System)
Search
Solr
NoSQL
HBase Accumulo
Slider
SECURITY OPERATIONS
In-Memory
Spark
Provision, Manage & Monitor
Ambari
Zookeeper
Scheduling
Oozie
Authentication Authorization
Audit Data Protection
Storage: HDFS
Resources: YARN Access: Hive
Pipeline: Falcon Cluster: Ranger Cluster: Knox
YARN is the architectural center of HDP
• Common data set across all applications
• Batch, interactive & real-time workloads
• Multi-tenant access & processing
Provides comprehensive enterprise capabilities
• Governance
• Security
• Operations
Enables broad ecosystem adoption
• ISVs can plug directly into Hadoop
The widest range of deployment options • Linux & Windows
• On premises & cloud
Others
ISV Engines
GOVERNANCE
Data Workflow, Lifecycle & Governance
Falcon Sqoop Flume Kafka NFS
WebHDFS
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
YARN: Data Operating System (Cluster Resource Management)
Stream
Storm
Slider
Deployment Choice Linux Windows Cloud On-Premises
Page 13 © Hortonworks Inc. 2014
Introduction to Apache Kafka & Apache Storm
Page 14 © Hortonworks Inc. 2014
What is Storm? Open source, real-time event stream processing platform that provides distributed,
continuous, & low latency processing for streaming data
• Horizontally scalable like Hadoop Highly scalable
• Automatically reassigns tasks on failed nodes Fault-tolerant
• Supports at least once & exactly once processing semantics Guarantees processing
• Processing logic can be defined in any language Language agnostic
• Brand, governance & a large active community Apache project
Page 15 © Hortonworks Inc. 2014
Storm Concepts Tuple: Storm’s data model. Named list of values, fields in a tuple can be of any data type Streams: Unbounded sequence of tuples Spouts: Source of streams Bolts: Performs data processing, transformation, joins, enrichment, aggregation and persist data. Can also emit tuples to downstream bolts Topology: Processing DAG of spouts and bolts wired together Stream groupings: A stream grouping tells a topology how to send tuples between two components.
A Storm Topology
Page 16 © Hortonworks Inc. 2014
Storm Architecture Nimbus (Management server) • Similar to job tracker • Distributes code around cluster • Assigns tasks • Handles failures Supervisor(slave nodes) • Similar to task tracker • Run bolts and spouts as ‘tasks’
Zookeeper • Cluster co-ordination • Stores cluster metrics • Trident State • Nimbus HA (planned for HDP Dal)
Page 17 © Hortonworks Inc. 2014
Apache Storm: Stream Processing
Storm KAFKA or JMS
Stream data into Storm Stream no9fica9ons from Storm
HDFS Data lake
In-‐memory caching plaMorms
Temporary data storage
RDBMS Provide reference data for Storm topologies
NoSQL Databases
Real-‐9me views for opera9onal dashboards
Search PlaMorms
Search interface for analysts & dashboards
Any App Development PlaMorm Simplify development of Storm topologies
Page 18 © Hortonworks Inc. 2014
What is Kafka? The Basics APACHE KAFKA
High throughput distributed messaging system Publish-Subscribe semantics but re-imagined at the implementation level to operate at speed with big data volumes
Kafka Cluster
producer
producer
producer
consumer
consumer
consumer
Page 19 © Hortonworks Inc. 2014
Kafka: Anatomy of a Topic Par99on 0 Par99on 1 Par99on 2
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10
11 11
12
Writes
Old
New
APACHE KAFKA
Page 20 © Hortonworks Inc. 2014
Kafka: Under the Hood Broker 1
Topic-‐1 Par99on-‐0
Zookeeper Stores Informa9on about cluster status and consumer offsets
APACHE KAFKA
Broker 2
Topic-‐1 Par99on-‐1
Broker 3
Topic-‐1 Par99on-‐2
producer
consumer
KaZa Cluster
producer
consumer
consumer
Page 21 © Hortonworks Inc. 2014
What’s New in HDP Champlain with Storm? Connectivity • JMS Connector • Added HBase “lookup” capability • Added temporal rotation for files in HDFS • Hive Streaming Ingest • Kafka Bolt • * Note: All features ALSO available via Trident APIs
Security • Authentication • Authorization via Apache Argus • Wire-level encryption between Storm
processes
Developer Productivity • Visual Topology Monitoring • Storm on YARN via Slider • Standalone – HDFS-less install via Ambari • Improved REST-based API • Pluggable Serialization for Multi-lang
Page 22 © Hortonworks Inc. 2014
New in HDP 2.2: Improved Connectivity
Page 23 © Hortonworks Inc. 2014
Kafka Bolt • Allows for data to be written from a topology (back) to Kafka • Powerful capability which allows for topologies to be
interconnected via Kafka Topics
Connectivity Enhancements JMS Connector • Supports a number of
different JMS providers (Testing with ActiveMQ & Oracle JMS)
• Addressed issues with message loss at scale
HBase Lookup • Capability to lookup data
from HBase within a Bolt
HDFS Connector with Temporal file rotation • Capability to rotate files
based on time, rather than on message volume.
Hive Streaming Ingest • Capability to write to Hbase
without intermediate HDFS writes
Page 24 © Hortonworks Inc. 2014
Connectivity Enhancements: Hive Streaming Ingest
Eliminates intermediate HDFS write and subsequent jobs to load data into Hive
• Requirements: Bucketed tables using ORCFile
• Supports partitioned tables – time can be used as the partition key
• Users can map tuple field names to table column names and also map one or more column names as partition columns.
• Hive 0.14 streaming API comes with kerberos support which is implemented as part of Storm-Hive config.
• Storm-Hive connector writes the tuples in configured batches. • Writing each tuple immediately would result in an inefficient implementation.
Results: Fewer steps, lower latency…faster access to data!
Page 25 © Hortonworks Inc. 2014
New in HDP 2.2: Developer Productivity
Page 26 © Hortonworks Inc. 2014
Monitor Topology Operational Metrics using Storm Topology Viewer • Spouts appear in Blue
• Bolts appear from Green to Red (based on capacity)
• Line width between Spouts and Bolts represent the flow of tuples relative to the other visible streams.
Page 27 © Hortonworks Inc. 2014
Storm on YARN via Slider Resource Manager
Scheduler
Node Manager
Container
NIMBUS
Node Manager-‐1
Container-‐1
SUPERVISOR-‐1
Node Manager-‐N
Container-‐N
SUPERVISOR-‐N
Zookeeper-‐1
Zookeeper-‐2
Zookeeper-‐N
• Multiple Storm clusters can be run side-by-side. • Using Slider one can increase or decrease Storm
cluster resources. – Adding or reducing the number of Supervisors.
• Storm-Slider command to deploy, list (topology operations) .
Page 28 © Hortonworks Inc. 2014
Ambari Based Management and Provisioning • Centralized
provisioning of Storm clusters
• Versioning of Storm configurations
• Manage Storm cluster operations
• Monitor Storm Clusters
Page 29 © Hortonworks Inc. 2014
New in HDP 2.2: Security Enhancements
Page 30 © Hortonworks Inc. 2014
Nimbus Authenticates with Kerberos Server using StormMaster
keytab
Security & Storm
Supervisors use client keytab to communicate with Zookeeper
NIMBUS SUPERVISOR-‐1
SUPERVISOR-‐N
Zookeeper-‐1
Zookeeper-‐2
Zookeeper-‐N
Kerberized Zookeeper
Cluster Nimbus use client keytab to communicate with Zookeeper
Storm UI
DRPC
Authenticates with Kerberos Server using StormMaster keytab
Kerberos
Storm UI connects to Nimbus via client keytab
Pluggable Access Control. Ships with Simple ACLController. Argus plugs in via storm.yaml config.
Any user in a trusted domain with a valid kerberos token can launch a topology.
Page 31 © Hortonworks Inc. 2014
Hortonworks Preferred Solution Architecture
HDP 2.x Data Lake
YARN
HDFS
APACHE KAFKA
Search Solr Slider
Online Data Processing HBase Accumulo
Real Time Stream Processing Storm SQL
Hive Streaming Ingest
HDFS
HDP 2.x
Real-time data feeds
Page 32 © Hortonworks Inc. 2014
Q & A
Page 33 © Hortonworks Inc. 2014
Thank you! Learn more at: hortonworks.com/hadoop/storm/
Register for the remaining
Discover HDP 2.2 Webinars
Hortonworks.com/webinars