Upload
softserve-inc
View
93
Download
1
Tags:
Embed Size (px)
Citation preview
Agenda
1) User-Group Introduction
2) Problematic
3) Log Data Analysis System Overview
4) Task Analysis
5) Solution Architecture
6) Trade-off Analysis
7) Automation
8) Performance Testing
9) Outcome & Plans
Demo Lab: Why we’ve started this project?
1) Increase Internal Experience
2) Create Reference Solution w/o NDA Limitations
3) Get Playground for Tests
4) Provide Demo Environment for Customers (using their data)
5) Decrease time to Market (by introducing automation)
Log Data Analysis Platform Details
Key Facts: • ~270-300 Web Servers • Log Types: HTTPD Access
logs, Error logs, Application Server Servlet, OS Service Logs
• ~500K events per minute
• 150GB of data per day
Technologies:• Flume• Hadoop/HDFS,
MapReduce• Hive, Impala• Oozie• Elasticsearch, Kibana 3• Tableau Analytics
platform• Puppet + Vagrant
Log Data ExamplesAccess log:127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 Error log:[Sun Mar 7 20:58:27 2004] [info] [client 64.242.88.10] (104)Connection reset by peer: client stopped connection before send body completed[Sun Mar 7 21:16:17 2004] [error] [client 24.70.56.49] File does not exist: /home/httpd/twiki/view/Main/WebHome Vmstatprocs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 305416 260688 29160 2356920 2 2 4 1 0 0 6 1 92 2 0 iostatLinux 2.6.32-100.28.5.el6.x86_64 (dev-db) 07/09/2011 avg-cpu: %user %nice %system %iowait %steal %idle 5.68 0.00 0.52 2.03 0.00 91.76
Solution Architecture
Batch Layer Serving Layer
Speed Layer
Raw Data Storage
Data Strea
m
Real-time Views
Static Views Precomputing
PrecomputingAd-hoc Batch
Views
Static Batch Views
Corporate BI Tool
Legend:Layer boundary
Data flow (with direction indicated)
Query flow
Apache HTTP Servers
Raw Data Storage Pre-computing Batch Views
Real-Time ViewsDashboard/
Search
Data Stream
Real-Time Processing and Aggregations
BI Tool
Avro as a Raw Data Storage file format
Parquet as a Batch Views file format
Star schema as a Batch Views data model
Automation (saves time and money)
80% 20%
Development and Debugging F&P Testing, Demo
Local Development Cloud Development
Automation Process
Phase Tool NotesVM Provisioning Vagrant — Supports:
VirtualBox, VMWare ESX, Amazon AWS
VM Bootstraping Puppet — Installs Cloudera Manager, Cloudera Distribution Hadoop, ElasticSearch+Kibana, Flume, Microstrategy, Log Generator.
— Creates Cluster using Cloudera Manager API.Configure ETL and BI
Puppet — Configures Flume, Oozie, ElasticSearch, Impala, Hive, Microstrategy Dashboards
Integration Tests Puppet — Generates Workload and ensures data go through.— Checks Logs for errors.— Calculates timing/throughput.
Log Generator
1 Thread can generate:4200 events / second (File source)5500 events / second (TCP source)
Outcome
1) Demo lab, playground, testing platform (in 1 hour)
2) Sizing Calculator3) Help to get 3 new customers (one is really,
really huge)4) Strategic Partnership with Cloudera5) Tons of experience and fun
Plans
1) Add support for other Hadoop Distributions (Hortonworks, MapR)
2) Make Project Open-Source
31
Thank You!
SoftServe US OfficeOne Congress Plaza, 111 Congress Avenue, Suite 2700 Austin, TX 78701 Tel: 512.516.8880
Contacts Valentyn [email protected]: 866.687.3588 x4341