If you can't read please download the document
Upload
oleksiy-kovyrin
View
269
Download
3
Embed Size (px)
Citation preview
Search Analytics with Flume & HBaseOtis Gospodneti Sematext International
Copyright 2010 Sematext Int'l. All rights reserved.
1
Agenda
Who I am What Why How Architecture Evolution Role of Flume and HBase + Flume HBase Sink Challenges
Copyright 2010 Sematext Int'l. All rights reserved.
2
About Otis Gospodneti Lucene/Solr/Nutch/Mahout committer Lucene in Action 1 & 2 co-author Lucene Consulting since 2005 Sematext Int'l since 2007
Copyright 2010 Sematext Int'l. All rights reserved.
3
About SematextConsulting, development, support for:
Big Data (Hadoop, HBase, Voldemort...) Search (Lucene, Solr, Elastic Search...) Web Crawling (Nutch) Machine Learning (Mahout)Copyright 2010 Sematext Int'l. All rights reserved.
4
What We Built
Analytics for Search
Numerous reports (e.g. query volume, rate, latency, term frequencies / comparisons, hit buckets, search origins, etc.) Trending over time Comparisons of time periods Top N reports Various report filters
Copyright 2010 Sematext Int'l. All rights reserved.
5
Report Example
Copyright 2010 Sematext Int'l. All rights reserved.
6
Why We Built itsubliminal msg: go use this site
We need it
search-hadoop.com & search-lucene.com
Search customers need it
Want to know what their visitors are searching for Want to know how their search is behaving
Copyright 2010 Sematext Int'l. All rights reserved.
7
How We Built it
JavaScript Beacons Metric Capture Web App Data Capture Mechanisms
Custom Log4J Appender Flume Agents, Collectors, Sinks
HBase MapReduce Aggregations Search Analytics Reporting Web AppCopyright 2010 Sematext Int'l. All rights reserved.
8
What's Flume
Distributed data/log collection service Scalable, configurable, extensible Centrally manageable, open source Agents get data from app, Collectors save it Abstractions: Source Decorator(s) Sink
Copyright 2010 Sematext Int'l. All rights reserved.
9
What's HBase
Scalable, reliable, distributed, column-oriented DB On top of HDFS MapReducable
Copyright 2010 Sematext Int'l. All rights reserved.
10
High Level Architecture
Copyright 2010 Sematext Int'l. All rights reserved.
11
Architecture #1
Copyright 2010 Sematext Int'l. All rights reserved.
12
Architecture #1 - Getting Messy
Copyright 2010 Sematext Int'l. All rights reserved.
13
Arch #2 HBaseLog4JAppender
Copyright 2010 Sematext Int'l. All rights reserved.
14
HBaseLog4JAppender Cons
Doesn't help with reliable delivery
e.g. when network or HBase down
Non-centralized config with larger clusters
e.g. changing destination table in HBase e.g. changing sampling rate
Copyright 2010 Sematext Int'l. All rights reserved.
15
Architecture #3 Flume OOTB
Copyright 2010 Sematext Int'l. All rights reserved.
16
Arch #4 Flume HBase Sink
Copyright 2010 Sematext Int'l. All rights reserved.
17
FLUME-247 Flume HBase sink
Contributed by Sematext in September 2010 Reviewed, pending commit Similar to FLUME-6 (basic example), but more flexible https://issues.cloudera.org/browse/FLUME-247Copyright 2010 Sematext Int'l. All rights reserved.
18
Walk-Through
Start EC2 micro instance, configure logs-generation tool to simulate user actions User actions start getting logged to a log file Configure Flume Agent to "tail" the generated logs and send data to Flume Collector Collector processes log messages and sends them to HBase's "raw logs" table Later these logs are processed by the MapReduce job
Search Action Metric Capture Log File Flume Agent Flume Collector Decorators HBase Sink HBase
Decorator: processes Flume Collector log events and prepares them for HBase HBase sink: FLUME-247
Copyright 2010 Sematext Int'l. All rights reserved.
19
Why Flume
Reliable delivery
e.g. queue msgs locally if destination unreachable
Easy, centralized management via Web UI or console Good community, good progress But: more complex, more moving parts On Flume: slideshare.net/cloudera/inside-flume
Copyright 2010 Sematext Int'l. All rights reserved.
20
Why HBase
Scalable raw search data storage MapReduce data input Scalable aggregate data storage Fast scans for time ranges, fast key lookups Easy storage and compute power expansion Good looking roadmap, community, progress
Copyright 2010 Sematext Int'l. All rights reserved.
21
Challenges
HBase in a box is like dynamic equilibrium, or virtual reality, or jumbo shrimp search-hadoop.com/m/p68C12nb7Hn Data size. Solutions:
Compression (4-5x smaller with lzo) Data pruning (variable levels) Lots of data to process, update, aggregate
Query string distribution: very long-tail
Copyright 2010 Sematext Int'l. All rights reserved.
22
Work @ SematextWe are hiring world-wide! Search & Data Analytics Machine Learning & NLP Biiig Data
Copyright 2010 Sematext Int'l. All rights reserved.
23
Contact sematext.com blog.sematext.com @sematext @otisg [email protected]
Copyright 2010 Sematext Int'l. All rights reserved.
24