Patricio Rocca - Agosto 2016@patriciorocca
Arquitecturas de tiempo real y
escalables
About Jampp
We are a tech company that helps companies grow their mobile business by driving engaged users to their appsWe are a team of 75 people, 30% in the engineering team.
Located in 6 cities across the US, Latin America, Europe and Africa
Machine learning
Post-install event optimisation
Dynamic Product Ads and Segments
Data Science
Programmatic Buying
We process 220,000 bid requests per second
We process each bid request in less than 100ms
We manage 40Tb of data everyday
We do real time machine learning
Jampp Architecture Impressive Facts
And… we are just a team of 22 nerds :) or :(
Bid
Real-Time Bidding Workflow
Auction Win
Exchange Exchange
Publisher Publisher
Jampp Bidder
Jampp Machine Learning
Jampp Engageme
nt Segments
Builder
IMPRESSION ;-)
PLACEHOLDER IMPRESSION
Real-Time Tracking Workflow
In-App Event
IMPRESSION ;-)
Click
Jampp NodeJS
TrackingPlatform
Client TrackingPlatform
Publisher Client
IMPRESSION
Let’s talk about architecture!
Real Time Bidding Architecture
Bid Price = CPI * eCTR * eCVR * (1-margin) * 1000
Python + Tornado + Cython + nginx (+ antigravity)
Caching, layers upon layers upon layers
Leaky bucket-ish feedback loop for pacing
With predictive local projections to account for imperfect and laggy inter-server communication
Selective, aggregate logging
Circa-25TB of data generated per day makes naïve logging… unwise
Online Machine Learning (stochastic gradient descent model)
Real Time Bidding Architecture (details)
In-process L1 serves all requests
µs latency access a lifesaver for real-time, latency-constrained workloads
Local L2 in each server
Buffers responses from the L3
Saves bandwidth to-from the L3(3 MB/s x 230 servers x 8 procs = death)
Decreases promotion latency to L1
Remote L3 provides main distributed cache storage
Remote S3 precomputed slow-changing bundles
Speeds up load of massive near-static data
Caching
Uses logistic regression to predict P(click | impression) or P(install | click) using context features
Online solution that incrementally learns from the Real Time Bidding events just in time
Uses regularization and hashing trick to explore a huge feature space and keep only the statistically most informative ones
Machine Learning
Stream Processing Architecture
Stream Processing Architecture (details)
Uses Amazon Kinesis for durable streaming data and Amazon Lambda for data processing
DynamoDB as temporal data storage for enrichment and analytics
S3 provides a Single Source of Truth for batch data applications
Decouples data from processing to enable multiple Big Data engines running on different clusters/infrastructure
Easy on demand scaling by AWS™
Data Push
Pick your partition key for evenly distributing data across shards
Encoding protocol matters! MessagePack offered the best trade off between compression and serialization speed factor
Data Processing and Enrichment
Write/Read Batching to reduce the HTTPS protocol overhead and costs
Exponential backoff + Jitter to reduce the impact of in-app events bursts sent by the tracking platforms
Increased Data Retention Period from 1 day (default) to 3 days on the raw data streams
Spark + Hadoop + PrestoDB = <3
Firehose real time data-ingestion to S3 and auto scaling capabilities
EMR Cluster simplifies our data processing
Spark ETLs are executed by Airflow, to enrich data, de-normalize and convert JSON to Parquet.
Spark Streaming for real-time anomaly detection and fraud prevention
Dunno fuck with real time! (caching and cython to the rescue)Rent first, build laterDevelopment and staging for Big Data projects should involve
production traffic or be prepared for troublePrestoDB is really amazing in regards to performance, maturity and
feature setKinesis, Dynamo and Firehose use HTTPS as transport protocol
which is slow, requires aggressive batching and exponential back-off + jitter
Monitoring, logs and alerts managed by AWS Cloudwatch oversimplifies production support
Lessons Learned
Gracias ;-)
geeks.jampp.com