Big Data in the
advertising industryMichael Dewhirst
Captify CTO; StrikeAd, DevZeroG co-founder
freediver, rock climber, photographer
Who am I?Born Moscow, Russia
UK from 1991
Working in Kiev (from London) since 1999
In IT/Software (professionaly) since 1994
Ex Java, HTML/JS, ABAP/SAP, .NET (shhh..), Notes, etc developer
Working with Big Data since 2010
Freediving and rockclimbing when not working
CompaniesStrikeAd (2010-2013: CTO, Co-founder)
Mobile advertising media DSP / trading platform
Processing 10’s of BN requests/month
Several “Big Data” solutions in place
Launched in 2010 (co founded)
Captify (2013-now: CTO)
Search re-targeting company
Processing 10’s of BN requests/month
Complex “dual” traffic and data workflow
Launched R&D dpt 2 months ago
Why is Big Data so key?
Pretty much everything in a business revolves around data and understanding it and there is exponentially more data every day to understand
What is Big Data
What is big data and what solutions can be classed as such?
What is Big Data
“Internet scale” / Billions of transactions a month
2000-5000+ QPS (queries per second)
What is Big Data
Processing time of under a second per transaction
Usually sub- 100ms
What is Big Data
Ability to aggregate, report and analyse processed data
in near real time or real-time
What data?Ad slots
Impressions
Clicks
Actions/conversions
Tracking pixels
Data feeds / databases
User ID
IP address
GPS lat long
Site category
Site URL
Age
Gender
Income
Connection type (mobile / wifi)
etc
The Challenge
The Challenge
A lot of volume
which needs retrospective accessquickly
(s)
Architecture, Design,
Solutions
Typical architecture
Modules/components:
1. Load Balancing
2. Actual processing
distributed identical workers
3. Logging
4. ETL (Extract Transform Load)
Processing logs, summarising/aggregating by keys
5. Aggregated data
6. “Big DataBase” (sometimes x2)
7. Machine learning
Big Data specific featuresLoad balancing
By geo - routing requests to nearest data centre
By load - usually round robin evenly distributing traffic between available nodes
DNS or software based (or both)
Big Data specific features
Storage RW/RO
In-mem only for real time data (sub 100ms access)
On disk for near-line, non-”realtime” access
Big Data specific features
Storage - in-mem (fast) - Sharding
Splitting data across several nodes (e.g. “A-C” - node1; “D-F” - node2, etc) - whole DB does not fit in one server memory
Hashing request data to determine storage node
2 tier architecture:
1) Load balancing tier evenly distributing traffic between available nodes - each LB is identical
2) Data storage tier, only processing relevant requests, each node only stores it’s chunk/shard of entire “spread out” DB
Sharding architecture
Dynamic scalingCloud based hosting charges are usually time based
Local continental data centres are needed
Traffic usually fluctuates significantly during the day, week, month and year
Cloud based hosting allows quick server/instance commissioning / decommissioning
Instances can be added as traffic trends grow and removed as they drop to save cost
Other areasAutomatic node updating (there can be 100’s to manage)
Monitoring and alerting (load, space, errors, etc)
Burn in - testing new code on a small cluster before upgrading whole network
Good security - firewalls, local user/file access, etc
Avoid having single points of failure
Old log near-line storage (e.g. Amazon Glacier)
Architecture, design, solutions
Any other “modules”?
Machine learning
What is machine learning?
Automated, algorithmic statistical data analysis and pattern detection
What?!
Used in advertising?
To help find repeatable actions with lowered risk and high expected outcome certainty
Meaning...Finding links between ad properties to buy more clicks or actions, e.g.
ad shown on site a, during lunch time, ad size 320x600, user from London, etc - CPC likelihood of 10%
user with iPhone, in Central Kiev, having been to dance club sites - 30% likelyhood of conversion to taxi advertising
Vendors and solutions
Vendors and solutionsApache Hadoop
Nginx
Erlang, OTP, etc
Aerospike
MongoDB
Amazon Redshift
Google Big Query
Dynamo
PostgreSQL
Memcache
Xtremedata
Vendors and solutionsDynDns
Nustar DNS
Nustar Quova Geo DB
Amazon Route53
Amazon Load Balancing
Real world examples
• Companies who have big data at their core
Google AdX / Double click
Online and mobile Advertising Exchange
Ad serving
Criteo
Conclusions
A complex, specialised industry and software development sub-category
Technically challenging by an order of magnitude
NOT only for “special” people - anybody can get in - I did
Genuinely interesting to work in
Questions?
The end
Thank you!