Upload
jo4134
View
501
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Publish Subscribe based Hadoop Distributed Notification System
Citation preview
A Publish-Subscribe Distributed Notification
System on Hadoop
Jyotiska Nath KhasnabishIIIT-Bangalore
HadoopOpen source distributed framework for processing
“Big Data”.
Offers distributed file system(HDFS) for storing massive amount of data across clusters.
MapReduce as a programming model for processing the large amount of data.
Adopted and used in production by 1000+ companies worldwide.
20+ popular Hadoop-based subprojects and growing.
Distributed Notification System [HDFS-1742] talks about a system that could notify
interested clients about major HDFS events (like file creation, deletion, etc), MapReduce job end notification.
[HDFS-2760] talks about adding a PubSub system on HDFS for sending notification messages to clients subscribed to specific services.
[HDFS-7821] talks about an event notification system which – Provide periodic updates to subscribed users Provide the capability to let users specify 'interesting events'. Provide a 'customizable' and 'configurable' interface such that
user-defined parameters can also be 'subscribed' by the user.
Publish Subscribe Model
Messaging Systems
Apache ActiveMQ
Uses JMS (Java Messaging Service) for sending and receiving messages.
Three components – Publisher, Broker, Subscriber.
Supports both Persistence and Non Persistence.
Apache Kafka
Developed by LinkedIn.
Three components – Producer, Broker, Consumer.
Supports both Persistent and Non Persistent Messaging.
Uses Zookeeper for co-ordination.
Architecture
Use Cases
1. Message Passing
Sending status flags or progress reports of running jobs among multiple Hadoop services.
Hadoop services can take the role of either a publisher or a subscriber.
Example – TaskTrackers only notifying JobTracker their status
where there is a status change.
2. Notification for Data Availability
Chained jobs get notified about the completion of some other job on which they are dependent.
No need to poll the NameNode for data availability in the HDFS.
Multiple subscribed services or jobs can be notified when the data is available.
3. Event Based Job Chaining
Multiple MapReduce jobs can be chained based on events occurring in the Hadoop cluster.
Easier for workflow managers to chain jobs and trigger workflows automatically.
Automatic setting of job dependency for heavily chained MapReduce jobs in order to accomplish a complex computation.
Cluster Configuration
Machine #1 Machine #2 Machine #3
Processing Speed
2.3 GHz 2.3 GHz 2.3 GHz
RAM 2 GB 2 GB 2 GB
Disk Space 8 GB 8 GB 8 GB
OS Ubuntu 12.04 Ubuntu 12.04 Ubuntu 12.04
Hadoop Version 1.1.1 1.1.1 1.1.1
ActiveMQ Version
5.8.0 5.8.0 5.8.0
Kafka Version 0.8 0.8 0.8
Performance AnalysisActiveMQ vs Kafka
Performance AnalysisSingle Node vs Multi Node
Performance ComparisonWith and Without Notification System
Hadoop Cluster Load
Before After
Network Bandwidth Consumption
Before After
Mobile Client
Conclusion
Distributed notification system based on Publish Subscribe messaging model.
Can be used to pass messages between services, notify subscribed clients and chain multiple jobs.
Reduces cluster load and network bandwidth consumption significantly resulting optimal use of hardware and resources.
Can be scaled to large Hadoop cluster, > 100/1000 nodes for handling heavily inter-dependent jobs.
Thank you