33
Page 1 © Hortonworks Inc. 2014 Discover HDP 2.2: Apache Kafka & Apache Storm for Stream Data Processing Hortonworks. We do Hadoop.

Discover.hdp2.2.storm and kafka.final

Embed Size (px)

Citation preview

Page 1: Discover.hdp2.2.storm and kafka.final

Page 1 © Hortonworks Inc. 2014

Discover HDP 2.2: Apache Kafka & Apache Storm for Stream Data Processing

Hortonworks. We do Hadoop.

Page 2: Discover.hdp2.2.storm and kafka.final

Page 2 © Hortonworks Inc. 2014

Speakers

Justin Sears

Hortonworks Product Marketing Manager

Rajiv Onat

Hortonworks Sr. Product Manager for Stream Data Processing

Taylor Goetz

Hortonworks Engineer, Apache Storm Committer & PMC Chair

Page 3: Discover.hdp2.2.storm and kafka.final

Page 3 © Hortonworks Inc. 2014

Agenda

•  Introduction to Apache Kafka and Apache Storm

•  New Streaming Innovation in HDP 2.2 §  Improved Connectivity

§  Developer Productivity

§  Security Enhancements

•  Q & A

We’ll move quickly: •  Attendee phone lines are muted •  Text any questions to Taylor Goetz using Webex chat •  Questions answered at the end

•  Unanswered questions and answers in upcoming blog post

Page 4: Discover.hdp2.2.storm and kafka.final

Page 4 © Hortonworks Inc. 2014

Big Data, Hadoop & Data Center Re-platforming

Business Drivers

•  From reactive analytics to proactive interactions

•  Insights that drive competitive advantage & optimal returns

Financial Drivers

•  Cost of data systems, as % of IT spend, continues to grow

•  Cost advantages of commodity hardware & open source software

$ Technical Drivers

•  Data is growing exponentially & existing systems overwhelmed

•  Predominantly driven by NEW types of data that can inform analytics

There is an inequitable balance between vendor and customer in the market

Page 5: Discover.hdp2.2.storm and kafka.final

Page 5 © Hortonworks Inc. 2014

Clickstream Capture and analyze website visitors’ data trails and optimize your website

Sensors Discover patterns in data streaming automatically from remote sensors and machines

Server Logs Research logs to diagnose process failures and prevent security breaches

New Types of Data Hadoop Value:

Sentiment Understand how your customers feel about your brand and products – right now

Geographic Analyze location-based data to manage operations where they occur

Unstructured Understand patterns in files across millions of web pages, emails, and documents

Page 6: Discover.hdp2.2.storm and kafka.final

Page 6 © Hortonworks Inc. 2014

A Shift from Reactive to Proactive Interactions

HDP and Hadoop allow organizations to use data to shift interactions from…

Reactive Post Transaction

Proactive Pre Decision

…to Real-time Personalization From static branding

…to repair before break From break then fix

…to Designer Medicine From mass treatment

…to Automated Algorithms From Educated Investing

…to 1x1 Targeting From mass branding

A shift in Advertising

A shift in Financial Services

A shift in Healthcare

A shift in Retail

A shift in Telco

Page 7: Discover.hdp2.2.storm and kafka.final

Page 7 © Hortonworks Inc. 2014

Enterprise Goals for the Modern Data Architecture

•  Consolidate siloed data sets structured and unstructured

•  Central data set on a single cluster

•  Multiple workloads across batch interactive and real time

•  Central services for security, governance and operation

•  Preserve existing investment in current tools and platforms

•  Single view of the customer, product, supply chain

APP

LIC

ATIO

NS

DAT

A S

YSTE

M

Business Analytics

Custom Applications

Packaged Applications

RDBMS

EDW

MPP

YARN: Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° N

Interactive Real-Time Batch CRM

ERP

Other 1 ° ° °

° ° ° °

HDFS (Hadoop Distributed File System)

SOU

RC

ES

EXISTING  Systems  

Clickstream   Web    &Social  

Geoloca9on   Sensor    &  Machine  

Server    Logs  

Unstructured  

Page 8: Discover.hdp2.2.storm and kafka.final

Page 8 © Hortonworks Inc. 2014

YARN Transformed Hadoop & Opened a New Era

YARN The Architectural Center of Hadoop

•  Common data platform, many applications

•  Support multi-tenant access & processing

•  Batch, interactive & real-time use cases

YARN: Data Operating System (Cluster Resource Management)

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

Script

Pig

SQL

Hive

Tez Tez

Java Scala

Cascading

Tez

° °

° °

° ° ° ° °

° ° ° ° °

Others

ISV Engines

HDFS (Hadoop Distributed File System)

Stream

Storm

Search

Solr

NoSQL

HBase Accumulo

Slider Slider

BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

In-Memory

Spark

Page 9: Discover.hdp2.2.storm and kafka.final

Page 9 © Hortonworks Inc. 2014

YARN Extends Hadoop to Other Data Center Leaders

YARN The Architectural Center of Hadoop

•  Common data platform, many applications

•  Support multi-tenant access & processing

•  Batch, interactive & real-time use cases

•  Supports 3rd-party ISV tools

(ex. SAS, Syncsort, Actian, etc.)

YARN Ready Applications Facilitates ongoing innovation and enterprise adoption via ecosystem of new and existing “YARN Ready” solutions

YARN: Data Operating System (Cluster Resource Management)

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

Script

Pig

SQL

Hive

Tez Tez

Java Scala

Cascading

Tez

° °

° °

° ° ° ° °

° ° ° ° °

Others

ISV Engines

HDFS (Hadoop Distributed File System)

Stream

Storm

Search

Solr

NoSQL

HBase Accumulo

Slider Slider

BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

In-Memory

Spark

Page 10: Discover.hdp2.2.storm and kafka.final

Page 10 © Hortonworks Inc. 2014

Enterprise Hadoop: Central Set of Services

YARN: Data Operating System (Cluster Resource Management)

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

° °

° °

° ° ° ° °

° ° ° ° °

Enables Apache Hadoop to be an Enterprise Data Platform with centralized services for:

•  Governance

•  Operations

•  Security

Everything that plugs into Hadoop inherits these services

Provision, Manage & Monitor

Ambari

Zookeeper

Scheduling

Oozie

Load data and manage

according to policy

Deploy and effectively

manage the platform

Provide layered approach to

security through Authentication, Authorization,

Accounting, and Data Protection

SECURITY GOVERNANCE OPERATIONS

Script

Pig

SQL

Hive

Java Scala

Cascading

Stream

Storm

Search

Solr

NoSQL

HBase Accumulo

BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

In-Memory

Spark

Others

ISV Engines

YARN: Data Operating System (Cluster Resource Management)

HDFS (Hadoop Distributed File System)

Tez Slider Slider Tez Tez

Page 11: Discover.hdp2.2.storm and kafka.final

Page 11 © Hortonworks Inc. 2014

Hortonworks Data Platform 2.2

HDP Delivers Enterprise Hadoop

YARN: Data Operating System (Cluster Resource Management)

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

Script

Pig

SQL

Hive

Tez Tez

Java Scala

Cascading

Tez

° °

° °

° ° ° ° °

° ° ° ° °

HDFS (Hadoop Distributed File System)

Stream

Storm

Search

Solr

NoSQL

HBase Accumulo

Slider Slider

SECURITY GOVERNANCE OPERATIONS BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

In-Memory

Spark

Provision, Manage & Monitor

Ambari

Zookeeper

Scheduling

Oozie

Data Workflow, Lifecycle & Governance

Falcon Sqoop Flume Kafka NFS

WebHDFS

Authentication Authorization

Audit Data Protection

Storage: HDFS

Resources: YARN Access: Hive

Pipeline: Falcon Cluster: Ranger Cluster: Knox

Deployment Choice Linux Windows Cloud

YARN is the architectural center of HDP

•  Common data set across all applications

•  Batch, interactive & real-time workloads

•  Multi-tenant access & processing

Provides comprehensive enterprise capabilities

•  Governance

•  Security

•  Operations

Enables broad ecosystem adoption

•  ISVs can plug directly into Hadoop

The widest range of deployment options •  Linux & Windows

•  On premises & cloud

Others

ISV Engines

On-Premises

Page 12: Discover.hdp2.2.storm and kafka.final

Page 12 © Hortonworks Inc. 2014

Hortonworks Data Platform 2.2

HDP Delivers Enterprise Hadoop

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

Script

Pig

SQL

Hive

Tez Tez

Java Scala

Cascading

Tez

° °

° °

° ° ° ° °

° ° ° ° °

HDFS (Hadoop Distributed File System)

Search

Solr

NoSQL

HBase Accumulo

Slider

SECURITY OPERATIONS

In-Memory

Spark

Provision, Manage & Monitor

Ambari

Zookeeper

Scheduling

Oozie

Authentication Authorization

Audit Data Protection

Storage: HDFS

Resources: YARN Access: Hive

Pipeline: Falcon Cluster: Ranger Cluster: Knox

YARN is the architectural center of HDP

•  Common data set across all applications

•  Batch, interactive & real-time workloads

•  Multi-tenant access & processing

Provides comprehensive enterprise capabilities

•  Governance

•  Security

•  Operations

Enables broad ecosystem adoption

•  ISVs can plug directly into Hadoop

The widest range of deployment options •  Linux & Windows

•  On premises & cloud

Others

ISV Engines

GOVERNANCE

Data Workflow, Lifecycle & Governance

Falcon Sqoop Flume Kafka NFS

WebHDFS

BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

YARN: Data Operating System (Cluster Resource Management)

Stream

Storm

Slider

Deployment Choice Linux Windows Cloud On-Premises

Page 13: Discover.hdp2.2.storm and kafka.final

Page 13 © Hortonworks Inc. 2014

Introduction to Apache Kafka & Apache Storm

Page 14: Discover.hdp2.2.storm and kafka.final

Page 14 © Hortonworks Inc. 2014

What is Storm? Open source, real-time event stream processing platform that provides distributed,

continuous, & low latency processing for streaming data

•  Horizontally scalable like Hadoop Highly scalable

•  Automatically reassigns tasks on failed nodes Fault-tolerant

•  Supports at least once & exactly once processing semantics Guarantees processing

•  Processing logic can be defined in any language Language agnostic

•  Brand, governance & a large active community Apache project

Page 15: Discover.hdp2.2.storm and kafka.final

Page 15 © Hortonworks Inc. 2014

Storm Concepts Tuple: Storm’s data model. Named list of values, fields in a tuple can be of any data type Streams: Unbounded sequence of tuples Spouts: Source of streams Bolts: Performs data processing, transformation, joins, enrichment, aggregation and persist data. Can also emit tuples to downstream bolts Topology: Processing DAG of spouts and bolts wired together Stream groupings: A stream grouping tells a topology how to send tuples between two components.

A Storm Topology

Page 16: Discover.hdp2.2.storm and kafka.final

Page 16 © Hortonworks Inc. 2014

Storm Architecture Nimbus (Management server) •  Similar to job tracker •  Distributes code around cluster •  Assigns tasks •  Handles failures Supervisor(slave nodes) •  Similar to task tracker •  Run bolts and spouts as ‘tasks’

Zookeeper •  Cluster co-ordination •  Stores cluster metrics •  Trident State •  Nimbus HA (planned for HDP Dal)

Page 17: Discover.hdp2.2.storm and kafka.final

Page 17 © Hortonworks Inc. 2014

Apache Storm: Stream Processing

Storm  KAFKA  or  JMS

Stream  data  into  Storm  Stream  no9fica9ons  from  Storm

HDFS    Data  lake  

In-­‐memory  caching  plaMorms

Temporary  data  storage    

RDBMS  Provide  reference  data  for  Storm  topologies

NoSQL    Databases  

Real-­‐9me  views  for  opera9onal  dashboards

Search  PlaMorms  

Search  interface  for  analysts  &  dashboards

Any  App  Development  PlaMorm  Simplify  development  of  Storm  topologies

Page 18: Discover.hdp2.2.storm and kafka.final

Page 18 © Hortonworks Inc. 2014

What is Kafka? The Basics APACHE  KAFKA  

High throughput distributed messaging system Publish-Subscribe semantics but re-imagined at the implementation level to operate at speed with big data volumes

Kafka Cluster

producer

producer

producer

consumer

consumer

consumer

Page 19: Discover.hdp2.2.storm and kafka.final

Page 19 © Hortonworks Inc. 2014

Kafka: Anatomy of a Topic Par99on  0   Par99on  1   Par99on  2  

 0   0   0  

1   1   1  

2   2   2  

3   3   3  

4   4   4  

5   5   5  

6   6   6  

7   7   7  

8   8   8  

9   9   9  

10   10  

11   11  

12  

Writes  

Old  

New  

APACHE  KAFKA  

Page 20: Discover.hdp2.2.storm and kafka.final

Page 20 © Hortonworks Inc. 2014

Kafka: Under the Hood Broker  1  

Topic-­‐1  Par99on-­‐0  

Zookeeper  Stores  Informa9on  about  cluster  status  and  consumer  offsets  

APACHE  KAFKA  

Broker  2  

Topic-­‐1  Par99on-­‐1  

Broker  3  

Topic-­‐1  Par99on-­‐2  

producer

consumer

KaZa  Cluster  

producer

consumer

consumer

Page 21: Discover.hdp2.2.storm and kafka.final

Page 21 © Hortonworks Inc. 2014

What’s New in HDP Champlain with Storm? Connectivity •  JMS Connector •  Added HBase “lookup” capability •  Added temporal rotation for files in HDFS •  Hive Streaming Ingest •  Kafka Bolt •  * Note: All features ALSO available via Trident APIs

Security •  Authentication •  Authorization via Apache Argus •  Wire-level encryption between Storm

processes

Developer Productivity •  Visual Topology Monitoring •  Storm on YARN via Slider •  Standalone – HDFS-less install via Ambari •  Improved REST-based API •  Pluggable Serialization for Multi-lang

Page 22: Discover.hdp2.2.storm and kafka.final

Page 22 © Hortonworks Inc. 2014

New in HDP 2.2: Improved Connectivity

Page 23: Discover.hdp2.2.storm and kafka.final

Page 23 © Hortonworks Inc. 2014

Kafka Bolt •  Allows for data to be written from a topology (back) to Kafka •  Powerful capability which allows for topologies to be

interconnected via Kafka Topics

Connectivity Enhancements JMS Connector •  Supports a number of

different JMS providers (Testing with ActiveMQ & Oracle JMS)

•  Addressed issues with message loss at scale

HBase Lookup •  Capability to lookup data

from HBase within a Bolt

HDFS Connector with Temporal file rotation •  Capability to rotate files

based on time, rather than on message volume.

Hive Streaming Ingest •  Capability to write to Hbase

without intermediate HDFS writes

Page 24: Discover.hdp2.2.storm and kafka.final

Page 24 © Hortonworks Inc. 2014

Connectivity Enhancements: Hive Streaming Ingest

Eliminates intermediate HDFS write and subsequent jobs to load data into Hive

•  Requirements: Bucketed tables using ORCFile

•  Supports partitioned tables – time can be used as the partition key

•  Users can map tuple field names to table column names and also map one or more column names as partition columns.

•  Hive 0.14 streaming API comes with kerberos support which is implemented as part of Storm-Hive config.

•  Storm-Hive connector writes the tuples in configured batches. •  Writing each tuple immediately would result in an inefficient implementation.

Results: Fewer steps, lower latency…faster access to data!

Page 25: Discover.hdp2.2.storm and kafka.final

Page 25 © Hortonworks Inc. 2014

New in HDP 2.2: Developer Productivity

Page 26: Discover.hdp2.2.storm and kafka.final

Page 26 © Hortonworks Inc. 2014

Monitor Topology Operational Metrics using Storm Topology Viewer •  Spouts appear in Blue

•  Bolts appear from Green to Red (based on capacity)

•  Line width between Spouts and Bolts represent the flow of tuples relative to the other visible streams.

Page 27: Discover.hdp2.2.storm and kafka.final

Page 27 © Hortonworks Inc. 2014

Storm on YARN via Slider Resource  Manager  

Scheduler  

Node  Manager  

Container  

NIMBUS  

Node  Manager-­‐1  

Container-­‐1  

SUPERVISOR-­‐1  

Node  Manager-­‐N  

Container-­‐N  

SUPERVISOR-­‐N  

Zookeeper-­‐1  

Zookeeper-­‐2  

Zookeeper-­‐N  

•  Multiple Storm clusters can be run side-by-side. •  Using Slider one can increase or decrease Storm

cluster resources. –  Adding or reducing the number of Supervisors.

•  Storm-Slider command to deploy, list (topology operations) .

Page 28: Discover.hdp2.2.storm and kafka.final

Page 28 © Hortonworks Inc. 2014

Ambari Based Management and Provisioning •  Centralized

provisioning of Storm clusters

•  Versioning of Storm configurations

•  Manage Storm cluster operations

•  Monitor Storm Clusters

Page 29: Discover.hdp2.2.storm and kafka.final

Page 29 © Hortonworks Inc. 2014

New in HDP 2.2: Security Enhancements

Page 30: Discover.hdp2.2.storm and kafka.final

Page 30 © Hortonworks Inc. 2014

Nimbus Authenticates with Kerberos Server using StormMaster

keytab

Security & Storm

Supervisors use client keytab to communicate with Zookeeper

NIMBUS   SUPERVISOR-­‐1  

SUPERVISOR-­‐N  

Zookeeper-­‐1  

Zookeeper-­‐2  

Zookeeper-­‐N  

Kerberized Zookeeper

Cluster Nimbus use client keytab to communicate with Zookeeper

Storm  UI  

DRPC  

Authenticates with Kerberos Server using StormMaster keytab

Kerberos

Storm UI connects to Nimbus via client keytab

Pluggable Access Control. Ships with Simple ACLController. Argus plugs in via storm.yaml config.

Any user in a trusted domain with a valid kerberos token can launch a topology.

Page 31: Discover.hdp2.2.storm and kafka.final

Page 31 © Hortonworks Inc. 2014

Hortonworks Preferred Solution Architecture

HDP 2.x Data Lake

YARN  

HDFS  

APACHE  KAFKA  

Search  Solr Slider  

 

Online  Data    Processing  HBase    Accumulo  

Real  Time  Stream    Processing  Storm   SQL  

Hive  Streaming Ingest

HDFS  

HDP 2.x

Real-time data feeds

Page 32: Discover.hdp2.2.storm and kafka.final

Page 32 © Hortonworks Inc. 2014

Q & A

Page 33: Discover.hdp2.2.storm and kafka.final

Page 33 © Hortonworks Inc. 2014

Thank you! Learn more at: hortonworks.com/hadoop/storm/

Register for the remaining

Discover HDP 2.2 Webinars

Hortonworks.com/webinars