モバイル KPI 分析の新標準Fluentd + Google BigQueryCloud Platformチームデベロッパーアドボケイト佐藤一憲
#gcpライブ
+Kazunori Sato@kazunori_279
Developer Advocate,
Cloud Platform, Google Inc
- GCP developer community support- GCP product launch support
agenda
Big Data in Google and Google BigQuery
Why BigQuery is so fast?
Real-time Streaming Import by Fluentd + BigQuery
Real-time KPI analytics by Lambda Architecture
Big Data in Google and
Google BigQuery
100 hours/min
100 petabytes
500+ million users
900+ million devices
Big Data in Google
Cloud Technology Innovations
2012 2013
MapReduce
Spanner/F1
2003 2006 2007 2010 2011
GFS
Omega
Colossus
Cloud Storage
Dremel
BigQuery
Big Table
Cloud Datastore
Paxos impl.
2004
At Google, we have “big” big data everywhere
What if a Googler is asked:“Can you give me the list of top 20 Android apps installed in 2012?”
In Google, we don’t use MapReduce for this
We use Dremel
Google BigQuery
FROM installlog.2012
ORDER BY
count DESC
It scans 100B rows in ~30 sec,No index used.
Google BigQuery: Massively Parallel Query Service
Storage: $0.020 per GB per month
Queries: $5 per TB
Cost of BigQuery
Gaming, Social, Mobile
Ads, Digital Marketing, DMP,
Media
Monitoring, Alerting and Security
Retails
Internet of Things (IoT)
Applications
BigQuery Analytic Service in the Cloud
BigQuery
R and Pandas
Microsoft Excel
Google Spreadsheet
Hadoop/Hive
Spark
Adwords
DoubleClick
Google Analytics
Event Logs,
Databases
IoT Devices
Analyze Export
BI Tools
Import
Import, Analyze and Export
Why BigQuery is so fast?
Column Oriented Storage
Record Oriented Storage Column Oriented Storage
Less bandwidth, More compression
select top(title), count(*)
from publicdata:samples.wikipedia
Massively Parallel Processing
Scanning 1 TB in 1 sectakes 5,000 disks
Each query runs on thousands of servers
Fast aggregation by tree
structureMixer 0
Mixer 1 Mixer 1
Shard Shard Shard Shard
ColumnIO on Colossus SELECT state, year
COUNT(*)
GROUP BY state
WHERE year >= 1980 and year < 1990
ORDER BY count_babies
DESC
LIMIT 10
COUNT(*)
GROUP BY state
Inside BQ: Big JOIN
Big JOIN: executed with shuffling
- Both tables can be > 8MB
- BQ shuffler doesn’t sort; just hash partitioning
From: Google BigQuery Analytics
Real-time Streaming Import
with Fluentd + BigQuery
“I want a real-time dashboard for collecting the votes and system stats from 200 servers”
BigQuery Streaming
Low cost: $0.01
per 100,000 rows
Real time
availability of data
100,000 rows per
second x tables
Slideshare uses Fluentd for collecting logs from >500 servers."We take full advantage of its extendable plugin architecture and use it as a message bus that collects data from
hundreds of servers into multiple backend systems." Sylvain Kalache, Operations Engineer
Why Fluentd? Because it’s super easy to use, and has extensive plugins written by active community.
Now Fluentd logs can be imported to BigQuery really easy, ~1M rows/s
Search “fluentd bigquery” on GitHub
Google Spreadsheet
IoT Example: RasPi > BigQuery > Spreadsheet
Real-time KPI Analytics with
Lambda Architecture
Lambda Architecture is:A complementary pair of:
- in-memory real-time processing
- large HDD/SSD batch processing
Proposed by Nathan
Marz
ex. Twitter
Summingbird
Slow, but large and persistent.
Fast, but small and volatile.
Norikra: an open source stream processing toolProduction use at LINE, the largest asian SNS with 500M users, for massive log
analysisSuper easy to use: requires no heavy-weighted cluster set-up
Real-time KPI analysis with SQL-based in-memory continuous query
Proposed Solution: Lambda Architecture
Proposed Solution: Lambda Architecture
Fluentd: event log collection from various event sources
Norikra: easy, scalable real time stream processing
BigQuery: scalable query engine for large datasets
1
2
3
Google Spreadsheet: flexible dashboard with charts
Docker: repeatable deployment in 10 minutes
4
5
● Gaming: How many new users has purchased the first item in last 10 minutes?
● Media: How many people hit the vote button during the live TV program?
● Retail: What is the current total revenue of all stores nationwide?
● Ads: What is the conversion rate of impressions/clicks to purchase?
● Co-relate system resource usage with access/application logs
● Real-time DoS or cheating detection
● Send e-mail notification from Apps Script triggered by Norikra
Real-time KPI Dashboard
Real-time Monitoring and Alerting
Applications
Easy real-time SQL-based KPI analytics
at 1M+ rows/sec by Norikra
Easy real-time streaming import
at 1M+ rows/sec by BigQuery + Fluentd
Search “lambda dashboard” on GitHub
Solution Benefits
Real-time dashboard with Google
SpreadsheetDeployable within 10 min with Docker