Upload
almase
View
218
Download
0
Embed Size (px)
Citation preview
8/14/2019 4.Flume
http://slidepdf.com/reader/full/4flume 1/19
Arinto MurdopoJosep Subirats
Group 4EEDC 2012
8/14/2019 4.Flume
http://slidepdf.com/reader/full/4flume 2/19
Outline
● Current problem● What is Apache Flume?● The Flume Model
○ Flows and Nodes
○ Agent, Processor and Collector Nodes○ Data and Control Path
● Flume goals○ Reliability
○ Scalability
○ Extensibility
○ Manageability
● Use case: Near Realtime Aggregator
8/14/2019 4.Flume
http://slidepdf.com/reader/full/4flume 3/19
Current Problem
● Situation:You have hundreds of services running in different serversthat produce lots of large logs which should be analyzed
altogether. You have Hadoop to process them.
● Problem:How do I send all my logs to a place that has Hadoop? I
need a reliable, scalable, extensible and manageable wayto do it!
8/14/2019 4.Flume
http://slidepdf.com/reader/full/4flume 4/19
What is Apache Flume?
● It is a distributed data collection service that getsflows of data (like logs) from their source andaggregates them to where they have to be processed.
● Goals: reliability, scalability, extensibility,manageability.
Exactly what I needed!
8/14/2019 4.Flume
http://slidepdf.com/reader/full/4flume 5/19
The Flume Model: Flows and Nodes
● A flow corresponds to a type of data source (serverlogs, machine monitoring metrics...).
● Flows are comprised of nodes chained together (see
slide 7).
8/14/2019 4.Flume
http://slidepdf.com/reader/full/4flume 7/19
The Flume Model: Agent, Processor andCollector Nodes
● Agent:receives data from anapplication.
● Processor (optional):
intermediate processing.
● Collector:write data to permanentstorage.
8/14/2019 4.Flume
http://slidepdf.com/reader/full/4flume 8/19
The Flume Model: Data and ControlPath (1/2)
Nodes are in the data path .
8/14/2019 4.Flume
http://slidepdf.com/reader/full/4flume 9/19
The Flume Model: Data and ControlPath (2/2)
Masters are in the control path .● Centralized point of configuration. Multiple: ZK.● Specify sources, sinks and control data flows.
8/14/2019 4.Flume
http://slidepdf.com/reader/full/4flume 10/19
Flume Goals: Reliability
Tunable Failure Recovery Modes
● Best Effort
● Store on Failure and Retry
● End to End Reliability
8/14/2019 4.Flume
http://slidepdf.com/reader/full/4flume 11/19
Flume Goals: Scalability
Horizontally Scalable Data Path
Load Balancing
8/14/2019 4.Flume
http://slidepdf.com/reader/full/4flume 12/19
Flume Goals: Scalability
Horizontally Scalable Control Path
8/14/2019 4.Flume
http://slidepdf.com/reader/full/4flume 13/19
Flume Goals: Extensibility
● Simple Source and Sink API○ Event streaming and composition of simple
operation
● Plug in Architecture
○ Add your own sources, sinks, decorators
8/14/2019 4.Flume
http://slidepdf.com/reader/full/4flume 14/19
Flume Goals: Manageability
Centralized Data Flow Management Interface
8/14/2019 4.Flume
http://slidepdf.com/reader/full/4flume 15/19
Flume Goals: Manageability
Configuring Flume
Node: tail(“file”) | filter [ console, roll(1000) { dfs(“hdfs://namenode/user/flume”) } ];
Output Bucketing /logs/web/2010/0715/1200/data-xxx.txt
/logs/web/2010/0715/1200/data-xxy.txt/logs/web/2010/0715/1300/data-xxx.txt/logs/web/2010/0715/1300/data-xxy.txt/logs/web/2010/0715/1400/data-xxx.txt
8/14/2019 4.Flume
http://slidepdf.com/reader/full/4flume 17/19
Conclusion
Flume is● Distributed data collection service ● Suitable for enterprise setting ● Large amount of log data to process
8/14/2019 4.Flume
http://slidepdf.com/reader/full/4flume 19/19
References
● http://www.cloudera.com/resource/chicago_data_summit_flume_an_introduction_jonathan_hsieh_hadoop_log_processing/
● http://www.slideshare.net/cloudera/inside-flume
● http://www.slideshare.net/cloudera/flume-intro100715
● http://www.slideshare.net/cloudera/flume-austin-hug-21711