Upload
jampp
View
366
Download
1
Embed Size (px)
Citation preview
Building a real-time, scalable and intelligent programmatic
ad buying platformMartín Bonamico
Juan Martín Pampliega
Agenda1. Jampp
2. Adtech, RTB, clicks, installs, events
3. Initial Architecture
4. Initial Architecture Characteristics
5. Evolution of Data Needs
6. New Data Infrastructure - Stream Processing
7. Key Take Aways
Jampp is a leading mobile app marketing and retargeting platform.
Founded in 2013, Jampp has offices in San Francisco, London, Berlin, Buenos Aires, São Paulo and Cape Town.
We help companies grow their business by seamlessly acquiring, engaging & retaining mobile app users.
Jampp’s platform combines machine learning with big data for programmatic ad buying which optimizes towards in-app activity.
Our platform processes +200,000 RTB ad bid requests per second (17+ billions per day) which amounts to about 300 MB/s or 25 TB of data per day.
How does programmatic ads work?
DOWNLOAD
APP
Source /Exchange Jampp Tracking
PlatformAppStore /
Google Play
App Install
Postback
Postback
Jampp Events1. RTB:
a. Auction: the exchange asks if we want to bid for the impression.
b. Bid/Non-Bid: bid with price or non-bid (less than 80ms).c. Impression: the ad is displayed to the user.
2. Non-RTB:a. Click: event that marks when the user clicks on the ad.b. Install: install of the app on first app open. c. Event: in app events like purchase, view, favorited.
Data @ Jampp● Our platform started using RDBMSs and a
traditional Data Warehouse architecture on Amazon Web Services.
● Data grew exponentially and data needs became more complex.
● In the last year alone, 2500%+ in-app events and 500%+ RTB bids.
● This made us evolve our architecture to be able to effectively handle Big Data.
C1
C2
Cn
Cupper
Load Balancer
MySQL
Click
Install
Event
ClickRedirect
PostgreSQL
B1 B2 Bn
Replicator
API(Pivots)
Auctions Bids ImpressionsInitial Jampp Infrastructure
Jampp Initial Systems: Bidder● OpenRTB bidding system implementation that runs on
200+ virtual machines with 70GB RAM each.
● Strong latency requirements. Less than 80ms to answer a request.
● Written in Cython and uses ZMQ for communication.
● Heavy use of coherent caching to comply with latency requirements.
● Data is continually replicated and enriched from MySQL by the replicator process.
Jampp Initial Systems: Cupper● Event tracking system written in Node.js.
● Tracks clicks, installs and in-app events. (200+ millions per day)
● Can be scaled horizontally (10 instances) and is located behind a load balancer (ELB).
● Uses a MySQL database to store attributed events and Kinesis to store organics.
Jampp Initial Systems: API● PostgreSQL is used as a Data Warehouse database apart
from the use the bidder does.
● An API exposes the data for querying with a caching layer.
● Fact tables are maintained with hourly, daily and monthly granularity and high cardinality dimensions are removed in large fact tables for data older than 15 days.
● Data is continually aggregated through an aggregation process written in Python.
Emerging Needs● Log forensics capabilities - as our systems and company
scale and we integrate with more outside systems.
● More historical and granular data for advanced analytics and model training.
● The need to make the data readily available to other systems outside from the traditional RDBMS arose. Some of these systems are too demanding for RDBMS to handle easily.
C1
C2
Cn
Cupper
Load Balancer
MySQL(Ruby)
Click
Install
Event
ClickRedirect
ELB Logs
C1
C2
Cn
EMR - Hadoop Cluster
AirPal
Initial Evolution
New System Characteristics● The new system was based on Amazon Elastic Map
Reduce.
● Data imported hourly from RDBMSs with Sqoop.
● Logs are imported every 10 minutes from different sources to S3 tables.
● Facebook PrestoDB and Apache Spark are used for interactive log and analytics.
New System Characteristics● Scalable storage and processing capabilities using
HDFS, YARN and Hive for ETLs and data storage.
● Connectors from different languages like Python, Julia and Java/Scala.
● Data archiving in S3 for long term storage and enabling other data processing technologies.
Aspects that needed improvement● Data still imported in batch mode. Delay was larger
for MySQL data than with Python replicator.
● EMR not great for long running clusters.
● The EMR cluster is not designed with strong multi-user capabilities. It is better to have multiple clusters with few users than a big one with many.
● Data still being accumulated in RDBMSs (clicks, installs, events).
Final stage of the evolution● Real-time event processing architecture based on
best practices for stream processing in AWS.
● Uses Amazon Kinesis for streaming data storage and Amazon Lambda for data processing.
● DynamoDB and Redis are used for temporal data storage for enrichment and analytics.
● S3 gives us a Source of Truth for batch data applications and Kinesis for stream processing.
Still, it isn’t perfect...● There is no easy way to manage windows and out or
order data with Amazon Lambda.
● Consistency of DynamoDB and S3.
● Price of AWS managed services for events with large numbers compared to custom maintained solutions.
● ACID guarantees of RDBMs are not an easy thing to part with.
● SQL and indexes in RDBMs make forensics easier.
Benefits of the Evolution● Enables the use of stream processing frameworks to
keep data as fresh as economically possible.
● Decouples data from processing to enable multiple Big Data engines running on different clusters/ infrastructure.
● Easy on demand scaling given by AWS managed tools like AWS Lambda, AWS DynamoDB and AWS EMR.
● Monitoring, logs and alerts managed by AWS Cloudwatch.
Key Take Aways● Ad tech is a technologically intensive market which
complies with the three Vs from Big Data.
● As the business’ data needs grows in complexity specialized data systems need to be put in place.
● Using technologies that are meant to scale easily and are managed by a third party can bring you peace of mind.
● Stream processing is fundamental in new Big Data Projects.
● There is currently no one tool that clearly fulfills all the needs for scalable and correct stream processing.
Referenceshttp://radar.oreilly.com/2015/08/the-world-beyond-batch-streaming-101.html
http://radar.oreilly.com/2015/08/the-world-beyond-batch-streaming-102.html
https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
http://blog.confluent.io/2015/01/29/making-sense-of-stream-processing/
JAMPP - AGRANDA 2015
http://44jaiio.sadio.org.ar/sites/default/files/agranda14-30.pdf