Building a real-time, scalable and intelligent programmatic ad buying platform

Building a real-time, scalable and intelligent programmatic

ad buying platformMartín Bonamico

Juan Martín Pampliega

Agenda1. Jampp

2. Adtech, RTB, clicks, installs, events

3. Initial Architecture

4. Initial Architecture Characteristics

5. Evolution of Data Needs

6. New Data Infrastructure - Stream Processing

7. Key Take Aways

Jampp and AdTech

Jampp is a leading mobile app marketing and retargeting platform.

Founded in 2013, Jampp has offices in San Francisco, London, Berlin, Buenos Aires, São Paulo and Cape Town.

We help companies grow their business by seamlessly acquiring, engaging & retaining mobile app users.

Jampp’s platform combines machine learning with big data for programmatic ad buying which optimizes towards in-app activity.

Our platform processes +200,000 RTB ad bid requests per second (17+ billions per day) which amounts to about 300 MB/s or 25 TB of data per day.

How does programmatic ads work?

DOWNLOAD

APP

Source /Exchange Jampp Tracking

PlatformAppStore /

Google Play

App Install

Postback

Postback

RTB: Real Time Bidding

Jampp Events1. RTB:

a. Auction: the exchange asks if we want to bid for the impression.

b. Bid/Non-Bid: bid with price or non-bid (less than 80ms).c. Impression: the ad is displayed to the user.

2. Non-RTB:a. Click: event that marks when the user clicks on the ad.b. Install: install of the app on first app open. c. Event: in app events like purchase, view, favorited.

Data @ Jampp● Our platform started using RDBMSs and a

traditional Data Warehouse architecture on Amazon Web Services.

● Data grew exponentially and data needs became more complex.

● In the last year alone, 2500%+ in-app events and 500%+ RTB bids.

● This made us evolve our architecture to be able to effectively handle Big Data.

Initial Data Architecture

C1

C2

Cn

Cupper

Load Balancer

MySQL

Click

Install

Event

ClickRedirect

PostgreSQL

B1 B2 Bn

Replicator

API(Pivots)

Auctions Bids ImpressionsInitial Jampp Infrastructure

Jampp Initial Systems: Bidder● OpenRTB bidding system implementation that runs on

200+ virtual machines with 70GB RAM each.

● Strong latency requirements. Less than 80ms to answer a request.

● Written in Cython and uses ZMQ for communication.

● Heavy use of coherent caching to comply with latency requirements.

● Data is continually replicated and enriched from MySQL by the replicator process.

Jampp Initial Systems: Cupper● Event tracking system written in Node.js.

● Tracks clicks, installs and in-app events. (200+ millions per day)

● Can be scaled horizontally (10 instances) and is located behind a load balancer (ELB).

● Uses a MySQL database to store attributed events and Kinesis to store organics.

Jampp Initial Systems: API● PostgreSQL is used as a Data Warehouse database apart

from the use the bidder does.

● An API exposes the data for querying with a caching layer.

● Fact tables are maintained with hourly, daily and monthly granularity and high cardinality dimensions are removed in large fact tables for data older than 15 days.

● Data is continually aggregated through an aggregation process written in Python.

Evolution of the Data Architecture

Emerging Needs● Log forensics capabilities - as our systems and company

scale and we integrate with more outside systems.

● More historical and granular data for advanced analytics and model training.

● The need to make the data readily available to other systems outside from the traditional RDBMS arose. Some of these systems are too demanding for RDBMS to handle easily.

C1

C2

Cn

Cupper

Load Balancer

MySQL(Ruby)

Click

Install

Event

ClickRedirect

ELB Logs

C1

C2

Cn

EMR - Hadoop Cluster

AirPal

Initial Evolution

New System Characteristics● The new system was based on Amazon Elastic Map

Reduce.

● Data imported hourly from RDBMSs with Sqoop.

● Logs are imported every 10 minutes from different sources to S3 tables.

● Facebook PrestoDB and Apache Spark are used for interactive log and analytics.

New System Characteristics● Scalable storage and processing capabilities using

HDFS, YARN and Hive for ETLs and data storage.

● Connectors from different languages like Python, Julia and Java/Scala.

● Data archiving in S3 for long term storage and enabling other data processing technologies.

Aspects that needed improvement● Data still imported in batch mode. Delay was larger

for MySQL data than with Python replicator.

● EMR not great for long running clusters.

● The EMR cluster is not designed with strong multi-user capabilities. It is better to have multiple clusters with few users than a big one with many.

● Data still being accumulated in RDBMSs (clicks, installs, events).

Final stage of the evolution● Real-time event processing architecture based on

best practices for stream processing in AWS.

● Uses Amazon Kinesis for streaming data storage and Amazon Lambda for data processing.

● DynamoDB and Redis are used for temporal data storage for enrichment and analytics.

● S3 gives us a Source of Truth for batch data applications and Kinesis for stream processing.

Our Real-Time Architecture

Still, it isn’t perfect...● There is no easy way to manage windows and out or

order data with Amazon Lambda.

● Consistency of DynamoDB and S3.

● Price of AWS managed services for events with large numbers compared to custom maintained solutions.

● ACID guarantees of RDBMs are not an easy thing to part with.

● SQL and indexes in RDBMs make forensics easier.

Benefits of the Evolution● Enables the use of stream processing frameworks to

keep data as fresh as economically possible.

● Decouples data from processing to enable multiple Big Data engines running on different clusters/ infrastructure.

● Easy on demand scaling given by AWS managed tools like AWS Lambda, AWS DynamoDB and AWS EMR.

● Monitoring, logs and alerts managed by AWS Cloudwatch.

Big Data Technologies at Jampp

S3HDFS

Hadoop/YARNLambda

DynamoDB

Key Take Aways● Ad tech is a technologically intensive market which

complies with the three Vs from Big Data.

● As the business’ data needs grows in complexity specialized data systems need to be put in place.

● Using technologies that are meant to scale easily and are managed by a third party can bring you peace of mind.

● Stream processing is fundamental in new Big Data Projects.

● There is currently no one tool that clearly fulfills all the needs for scalable and correct stream processing.

Referenceshttp://radar.oreilly.com/2015/08/the-world-beyond-batch-streaming-101.html

http://radar.oreilly.com/2015/08/the-world-beyond-batch-streaming-102.html

https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

http://blog.confluent.io/2015/01/29/making-sense-of-stream-processing/

JAMPP - AGRANDA 2015

http://44jaiio.sadio.org.ar/sites/default/files/agranda14-30.pdf

















Questions?geeks.jampp.com

We Are Hiring! - jampp.com/jobs.php

[email protected]

[email protected]

Software

Building a real-time, scalable and intelligent programmatic ad buying platform