Upload
proidea
View
122
Download
3
Embed Size (px)
DESCRIPTION
-The evolution of Big Data, both inside Akamai and in the industry. -The current Big Data Ecosystem with real-world examples. -Challenges in Big Data and future directions.
Citation preview
Real… Big… Data… and it’s constant evolution Scott MacGregor
Who is this guy?
Akamai Big Data Infrastructure
150,000 collector nodes 5000 map/reduce nodes Billions of jobs per day
What is Big Data?
The V’s
Data that is Big
From Hortonworks
What’s it really about?
From the beginning…
• Akamai needed a billing system and scalable monitoring • The Open Source community wanted a search engine • Yahoo needed better product analytics for page views • Google needed more scalable computation for ad
management • Facebook needed real-time updates to social graph • LinkedIn needed a real-time activity data pipeline • Twitter needed hashtag and topic streams • Amazon needed durable shopping carts • Netflix needed a recommendation engine
Big Data timeline
1998 2006 2001 2003 2005 2007 2008 2010 2011 2012 2013 2014
Akamai
Industry
Generalized map/reduce on 1 machine
Decentralized job scheduling Multiple machines File System DB
Google MapReduce Google FS
Nutch Yahoo spins off Hadoop
Amazon Dynamo
NoSql
Wide area, real-time, in-memory system monitoring
Geographical redundancy
Real-time reporting Columnar DB
Distributed File System DB
Wide-area MapReduce ExaByte Query
HBASE Neo4J
Facebook Cassandra LinkedIn Kafka
Twitter Storm Facebook Presto
How it works…
Big Data modes
• Batch – Computation over a large static data set – Results are complete
• Online – Computation on data as it’s generated – Localized results, must be aggregated
downstream
Big Data primitives
• Collection • Parsing • Partitioning • Filtering • Throttling • Aggregation • Tracking • Validation • Analysis
Collection
• What – Logs – Metadata – System stats – Application
events – Application stats – Network data
• How – Email – SPDY – HTTP POST – SCP – Scribe – Avro – Custom
Parsing
• Read lines or blocks and split into fields • Transform, e.g. protobuf • Map keys to values
S 1359487051.701 4200191766 2097152 2097918 61.252.169.21 GET 440 1423 itunesus011.download.akamai.com 200 - iPeV image/jpeg - - 44 3031 - - - - - W - us/r1000/011/Purple/53/e6/0f/mzl.slohufby.320x480-75.jpg - a440.phobos.apple.com
1359486900 1423 a440.phobos.apple.com 1 3158
1359486900 1423 200 1 30128
1359486900 1423 1 209158
Partitioning
• Bucketing – Reduce to a single record per bucket – e.g. 5 minutes, /24, etc.
• Hashing – Bucket blocks or records of data by a hash
function
Filtering
• Statistical Methods – Top-k (HierarchicalCountSketch) – Set membership (Bloom filters) – Cardinality counting (HyperLogLog) – Frequency estimates (CountSketch) – Change detection (Deltoid)
• Sampling – Random – Reservoir
Throttling
• Limit on cardinality per partition – Requires central management – Drop records over max
• Remove or trim large fields S 1359487051.701 4200191766 2097152 2097918 61.252.169.21 GET 440 1423 itunesus011.download.akamai.com 200 - iPeV image/jpeg - - 44 3031 - - - - - W - us/r1000/011/Purple/53/e6/0f/mzl.slohufby.320x480-75.jpg - a440.phobos.apple.com
S 1359487051.701 4200191766 2097152 2097918 61.252.169.21 GET 440 1423 ~ 200 - iPeV image/jpeg - - 44 3031 - - - - - W - ~
Aggregation
• Merge – Merge-sort blocks in a partition
• Reduce – Combine values for like keys
• Sum, Min, Max, Mask, etc. • Shuffle
– Move the data to where its needed or closer to like data
1359486900 1423 1 209158
1359529800 1423 1 209158 1359486900 1423 1 209158
1359486900 1423 2 418316
1359529800 1423 1 209158
Aggregate
2 418316
{1423, 1359486900}
1 209158
{1423, 1359529800}
Shuffle
Tracking
• Tracking – Embed GUID in each data unit sent – Publish GUIDs independent from data flow – Completeness is expected (published GUIDs)
vs. actual (embedded GUID)
Data integrity
• Watermark – Producer watermarks every n-lines with a
crypto key – Receiver checks watermarks
• Checksum – Block checksums – Line CRC – Etc.
Analysis
• Online – Precomputed reports
• Batch – Spark Programs – Map/Reduce – Hive: HQL – SQL
Big Data at Akamai
• Billing and Reporting • System monitoring • Media Analytics • Security • Log archive
Billing and reporting
Logs Akamai Edge Networks and
Products Q Parse
Pipelines
Shuffle Split
Billing DB
Reporting Reporting
Reporting Parsing • splits lines into fields • maps keys to values per pipeline • each log generates many pipelines • each pipeline represents a streaming table
Evolution • Logs were emailed (up to 1PB/day) • Now delivered via SPDY (3PB/day)
Customers
3 PB/day Doubles every year
Reporting Reporting Internal
Apps
Aggregate
System monitoring
Akamai Networks and
Products Client SQL
Parser TLA Agg
Agg Agg
Alert
Trend
TLA: top level aggregator pulls data from aggregators which pull data from producers at the time of the request Produces rewrite data locally
50M jobs/day
Evolution Single machine memory for table joins Future: distributed memory for table joins
Media analytics
Pipelines Akamai
Products Front end
Column Store
Index Reporting Reporting Reporting
API / UI
Customers
Indexes are recreated for each update Supports insert and update Reads are flexible and fast
Evolution: Index now fingerprint to lower cost Hyperloglog for uniqueness counting
Events
Security products
Pipelines Akamai Edge
Networks Front end
HDFS
Events
Akamai Web Firewall
Map/Reduce
HBASE
Hive
Cloudera Graphite
Operations Center
Reputation Scoring
Threat Analysis
Intelligence Reports
Risk Based Authentication
Payment Fraud
External Data External Data
External Data
Evolution: Replacing HBASE with custom aggregator Replacing Hive with custom SQL processor
20 TB/day
Log archive
Logs
Q Archive
Parse
180 PB, 450 Trillion records Doubles every year
Archive Index (10TB) Pipelines
Log cache 10%
Client IP Sketch
Spark
Spark SQL
HDFS
Archive Front End
Client Request
Archive is 90 data centers distributed over wide area; projected 1.2 EB in 3 years Evolution: Was flat file for index, now HDFS/Spark
Get Index and/or CIP
Cache first Then archive
HDFS Hadoop / Yarn
The Ecosystem
Script Pig
SQL Hive
NoSQL HBASE
Stream Kafka Storm
Search Solr
In-Mem Spark
Integration Flume Avro
Operations Ambari Zookeeper Oozie
Monitoring
Graphite
Sharing
Mesos
HDFS Hadoop / Yarn
Building a system
If you need fast access to massive amounts of data where queries are constrained to an index (read optimized): • Start with HDFS or Cassandra • Add HBASE column store • Add Hive for SQL-like access • Add Pig for scripting
HBASE Get, Put
Hive Select *
Pig { … }
Building a system
If you need to search logs: • Start with HDFS • Add Flume for log data integration • Add Avro for data serialization • Add Solr for search
HDFS Hadoop / Yarn
Solr Search, e.g. Ip = 1.1.1.1
Flume Agent Avro Sink
Flume Collector Avro Source
HDFS Hadoop / Yarn
Building a system
If you need flexible and shared access to unlimited amounts of data: • Start with HDFS or Cassandra • Add Hadoop for Map/Reduce or • Add Hive for SQL-like access or • Add Pig for scripting • Add Mesos for resource sharing • Add Ambari for cluster management and provisioning • Add map/reduce programs for business logic
Pig {…}
Hive Select * Flume Ambari
Mesos
Map/Reduce Java { … }
Building a system
If you need fast, flexible access to in-memory data: • Start with HDFS • Add Spark • Add Spark SQL for SQL-like access or • Create Spark programs for other business logic
HDFS Hadoop / Yarn
Spark
SparkSQL Select * from
Spark Progs Java { … }
Building a system
If you need real-time stream event processing: • Start with HDFS • Add Kafka for messaging and pub/sub • Add Storm for event processing • Develop Java Bolts for processing logic
HDFS Hadoop / Yarn
Kafka Storm Bolts { … }
Future at Akamai
• 100x – Everything bigger and faster – Requires new R&D across many Big Data
components • Scaling Big Data Eco across wide-area • Internet Security
• Positive reputation scoring • Automatic DDoS mitigation
• Low latency data collection – 2^53 unique keys, <1 minute latency
• Support DevOps – Near real-time monitoring and control
Thank You