Build DMP on top of GCP
VMFive - Randy Huang
Agenda
• Migrated Pipeline to GCP
• Cost Comparison
• Business Use Case
• Fluentd Demo
ELK + AWS EMR
Kinesis Lambda
Pros & Cons• Pros :
• Well Support.
• Well docs.
• Easy to find Reference.
• Cons :
• High Cost.
• Not open source.
• Have to set the scale at first.
Pipeline on GCP
Dataflow
BigQuery
Machine Learning
Data Visualization
Compute Engine
Global Load Balancing
Datastudio
The Products and Services logos may be used to accurately reference Google's technology and tools, for instance in architecture diagrams. 7
Batch
BI Analysis
Storage Cloud Storage
Processing Cloud DataflowStreaming
Time Series Streaming Cloud Pub/Sub
Storage BigQuery
The Products and Services logos may be used to accurately reference Google's technology and tools, for instance in architecture diagrams. 8
Targeting Engines
Data Sources
Machine Learning Applications
API Backend Compute Engine
Spark MLlib Cloud Dataproc
App Engine
Transform Data
Hosted Models Cloud Machine Learning
Real-Time Prediction API
Device Related Cloud Pub/Sub
Behavior Related Cloud Pub/Sub
3rd Party Data Cloud Pub/Sub
Redis Compute Engine
Pros & Cons• Pros :
• Cost-effective.
• Operation-effective.
• Google got your back.
• Cons :
• API/SDK changes everyday.
• Some still in beta mode.
• Docs everywhere.
Workflow Monitoring• Digdag <Airflow/Oozie/Luigi>
• Native support Python & Ruby
• Multi-Cloud
• Modular
• Workflow as code
• Docker Support
• Altering to Slack
Digdag Sample
Digdag
Cost Comparison
• $2000 on AWS per month
• about $200 on GCP production
• about another $200 for dev
• 50M events per month
Business Use Case• Digital Ads Targeting
• User Behavior Tagging
• BI
• GEO Reporting
• KPI Reporting
• User Demographic
Some Tips• BigQuery
• https://status.cloud.google.com/incident/bigquery/18022
• Solved by Fluentd’s Retry and HA
• Dataflow’s SDK & docs is not sync
• Dataflow Sideinput has a bug with Streaming mode
• Compute Engine SLB - TCP/UDP setup for forwarding
Flunetd Update
• Release note for v0.14
• sub second event flush
• New Plugin APIS support formatting configurations dynamically
(e.g., path /my/dest/${tag}/mydata.%Y-%m-%d.log)
• Secure Forward
Demo
• Nginx -> Fluentd -> BigQuery -> DataStudio
• MySQL -> Fluentd -> BigQuery