19
Arinto Murdopo Josep Subirats Group 4 EEDC 2012

4.Flume

  • Upload
    almase

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

8/14/2019 4.Flume

http://slidepdf.com/reader/full/4flume 1/19

Arinto MurdopoJosep Subirats

Group 4EEDC 2012

8/14/2019 4.Flume

http://slidepdf.com/reader/full/4flume 2/19

Outline

● Current problem● What is Apache Flume?● The Flume Model

○ Flows and Nodes

○ Agent, Processor and Collector Nodes○ Data and Control Path

● Flume goals○ Reliability

○ Scalability

○ Extensibility

○ Manageability

● Use case: Near Realtime Aggregator

8/14/2019 4.Flume

http://slidepdf.com/reader/full/4flume 3/19

Current Problem

● Situation:You have hundreds of services running in different serversthat produce lots of large logs which should be analyzed

altogether. You have Hadoop to process them.

● Problem:How do I send all my logs to a place that has Hadoop? I

need a reliable, scalable, extensible and manageable wayto do it!

8/14/2019 4.Flume

http://slidepdf.com/reader/full/4flume 4/19

What is Apache Flume?

● It is a distributed data collection service that getsflows of data (like logs) from their source andaggregates them to where they have to be processed.

● Goals: reliability, scalability, extensibility,manageability.

Exactly what I needed!

8/14/2019 4.Flume

http://slidepdf.com/reader/full/4flume 5/19

The Flume Model: Flows and Nodes

● A flow corresponds to a type of data source (serverlogs, machine monitoring metrics...).

● Flows are comprised of nodes chained together (see

slide 7).

8/14/2019 4.Flume

http://slidepdf.com/reader/full/4flume 6/19

8/14/2019 4.Flume

http://slidepdf.com/reader/full/4flume 7/19

The Flume Model: Agent, Processor andCollector Nodes

● Agent:receives data from anapplication.

● Processor (optional):

intermediate processing.

● Collector:write data to permanentstorage.

8/14/2019 4.Flume

http://slidepdf.com/reader/full/4flume 8/19

The Flume Model: Data and ControlPath (1/2)

Nodes are in the data path .

8/14/2019 4.Flume

http://slidepdf.com/reader/full/4flume 9/19

The Flume Model: Data and ControlPath (2/2)

Masters are in the control path .● Centralized point of configuration. Multiple: ZK.● Specify sources, sinks and control data flows.

8/14/2019 4.Flume

http://slidepdf.com/reader/full/4flume 10/19

Flume Goals: Reliability

Tunable Failure Recovery Modes

● Best Effort

● Store on Failure and Retry

● End to End Reliability

8/14/2019 4.Flume

http://slidepdf.com/reader/full/4flume 11/19

Flume Goals: Scalability

Horizontally Scalable Data Path

Load Balancing

8/14/2019 4.Flume

http://slidepdf.com/reader/full/4flume 12/19

Flume Goals: Scalability

Horizontally Scalable Control Path

8/14/2019 4.Flume

http://slidepdf.com/reader/full/4flume 13/19

Flume Goals: Extensibility

● Simple Source and Sink API○ Event streaming and composition of simple

operation

● Plug in Architecture

○ Add your own sources, sinks, decorators

8/14/2019 4.Flume

http://slidepdf.com/reader/full/4flume 14/19

Flume Goals: Manageability

Centralized Data Flow Management Interface

8/14/2019 4.Flume

http://slidepdf.com/reader/full/4flume 15/19

Flume Goals: Manageability

Configuring Flume

Node: tail(“file”) | filter [ console, roll(1000) { dfs(“hdfs://namenode/user/flume”) } ];

Output Bucketing /logs/web/2010/0715/1200/data-xxx.txt

/logs/web/2010/0715/1200/data-xxy.txt/logs/web/2010/0715/1300/data-xxx.txt/logs/web/2010/0715/1300/data-xxy.txt/logs/web/2010/0715/1400/data-xxx.txt

8/14/2019 4.Flume

http://slidepdf.com/reader/full/4flume 16/19

Use Case: Near Realtime Aggregator

8/14/2019 4.Flume

http://slidepdf.com/reader/full/4flume 17/19

Conclusion

Flume is● Distributed data collection service ● Suitable for enterprise setting ● Large amount of log data to process

8/14/2019 4.Flume

http://slidepdf.com/reader/full/4flume 18/19

Questions to be unveiled?

Q&A

8/14/2019 4.Flume

http://slidepdf.com/reader/full/4flume 19/19

References

● http://www.cloudera.com/resource/chicago_data_summit_flume_an_introduction_jonathan_hsieh_hadoop_log_processing/

● http://www.slideshare.net/cloudera/inside-flume

● http://www.slideshare.net/cloudera/flume-intro100715

● http://www.slideshare.net/cloudera/flume-austin-hug-21711