High availability, real-time and scalable architectures

Patricio Rocca - Agosto 2016@patriciorocca

Arquitecturas de tiempo real y

escalables

About Jampp

We are a tech company that helps companies grow their mobile business by driving engaged users to their appsWe are a team of 75 people, 30% in the engineering team.

Located in 6 cities across the US, Latin America, Europe and Africa

Machine learning

Post-install event optimisation

Dynamic Product Ads and Segments

Data Science

Programmatic Buying

We process 220,000 bid requests per second

We process each bid request in less than 100ms

We manage 40Tb of data everyday

We do real time machine learning

Jampp Architecture Impressive Facts

And… we are just a team of 22 nerds :) or :(

Bid

Real-Time Bidding Workflow

Auction Win

Exchange Exchange

Publisher Publisher

Jampp Bidder

Jampp Machine Learning

Jampp Engageme

nt Segments

Builder

IMPRESSION ;-)

PLACEHOLDER IMPRESSION

Real-Time Tracking Workflow

In-App Event

IMPRESSION ;-)

Click

Jampp NodeJS

TrackingPlatform

Client TrackingPlatform

Publisher Client

IMPRESSION

Let’s talk about architecture!

Real Time Bidding Architecture

Bid Price = CPI * eCTR * eCVR * (1-margin) * 1000

Python + Tornado + Cython + nginx (+ antigravity)

Caching, layers upon layers upon layers

Leaky bucket-ish feedback loop for pacing

With predictive local projections to account for imperfect and laggy inter-server communication

Selective, aggregate logging

Circa-25TB of data generated per day makes naïve logging… unwise

Online Machine Learning (stochastic gradient descent model)

Real Time Bidding Architecture (details)

In-process L1 serves all requests

µs latency access a lifesaver for real-time, latency-constrained workloads

Local L2 in each server

Buffers responses from the L3

Saves bandwidth to-from the L3(3 MB/s x 230 servers x 8 procs = death)

Decreases promotion latency to L1

Remote L3 provides main distributed cache storage

Remote S3 precomputed slow-changing bundles

Speeds up load of massive near-static data

Caching

Uses logistic regression to predict P(click | impression) or P(install | click) using context features

Online solution that incrementally learns from the Real Time Bidding events just in time

Uses regularization and hashing trick to explore a huge feature space and keep only the statistically most informative ones

Machine Learning

Stream Processing Architecture

Stream Processing Architecture (details)

Uses Amazon Kinesis for durable streaming data and Amazon Lambda for data processing

DynamoDB as temporal data storage for enrichment and analytics

S3 provides a Single Source of Truth for batch data applications

Decouples data from processing to enable multiple Big Data engines running on different clusters/infrastructure

Easy on demand scaling by AWS™

Data Push

Pick your partition key for evenly distributing data across shards

Encoding protocol matters! MessagePack offered the best trade off between compression and serialization speed factor

Data Processing and Enrichment

Write/Read Batching to reduce the HTTPS protocol overhead and costs

Exponential backoff + Jitter to reduce the impact of in-app events bursts sent by the tracking platforms

Increased Data Retention Period from 1 day (default) to 3 days on the raw data streams

Spark + Hadoop + PrestoDB = <3

Firehose real time data-ingestion to S3 and auto scaling capabilities

EMR Cluster simplifies our data processing

Spark ETLs are executed by Airflow, to enrich data, de-normalize and convert JSON to Parquet.

Spark Streaming for real-time anomaly detection and fraud prevention

Dunno fuck with real time! (caching and cython to the rescue)Rent first, build laterDevelopment and staging for Big Data projects should involve

production traffic or be prepared for troublePrestoDB is really amazing in regards to performance, maturity and

feature setKinesis, Dynamo and Firehose use HTTPS as transport protocol

which is slow, requires aggressive batching and exponential back-off + jitter

Monitoring, logs and alerts managed by AWS Cloudwatch oversimplifies production support

Lessons Learned

Gracias ;-)

geeks.jampp.com

Technology

High availability, real-time and scalable architectures