Download ppt - Big Data in the advertising industry (by Michael Dewhirst) - Big Data Tech Hangout - 2013.10.26

Big Data in the

advertising industryMichael Dewhirst

Captify CTO; StrikeAd, DevZeroG co-founder

freediver, rock climber, photographer

Who am I?Born Moscow, Russia

UK from 1991

Working in Kiev (from London) since 1999

In IT/Software (professionaly) since 1994

Ex Java, HTML/JS, ABAP/SAP, .NET (shhh..), Notes, etc developer

Working with Big Data since 2010

Freediving and rockclimbing when not working

CompaniesStrikeAd (2010-2013: CTO, Co-founder)

Mobile advertising media DSP / trading platform

Processing 10’s of BN requests/month

Several “Big Data” solutions in place

Launched in 2010 (co founded)

Captify (2013-now: CTO)

Search re-targeting company

Processing 10’s of BN requests/month

Complex “dual” traffic and data workflow

Launched R&D dpt 2 months ago

Why is Big Data so key?

Pretty much everything in a business revolves around data and understanding it and there is exponentially more data every day to understand

What is Big Data

What is big data and what solutions can be classed as such?

What is Big Data

“Internet scale” / Billions of transactions a month

2000-5000+ QPS (queries per second)

What is Big Data

Processing time of under a second per transaction

Usually sub- 100ms

What is Big Data

Ability to aggregate, report and analyse processed data

in near real time or real-time

What data?Ad slots

Impressions

Clicks

Actions/conversions

Tracking pixels

Data feeds / databases

User ID

IP address

GPS lat long

Site category

Site URL

Age

Gender

Income

Connection type (mobile / wifi)

etc

The Challenge

The Challenge

A lot of volume

which needs retrospective accessquickly

(s)

Architecture, Design,

Solutions

Typical architecture

Modules/components:

1. Load Balancing

2. Actual processing

distributed identical workers

3. Logging

4. ETL (Extract Transform Load)

Processing logs, summarising/aggregating by keys

5. Aggregated data

6. “Big DataBase” (sometimes x2)

7. Machine learning

Big Data specific featuresLoad balancing

By geo - routing requests to nearest data centre

By load - usually round robin evenly distributing traffic between available nodes

DNS or software based (or both)

Big Data specific features

Storage RW/RO

In-mem only for real time data (sub 100ms access)

On disk for near-line, non-”realtime” access

Big Data specific features

Storage - in-mem (fast) - Sharding

Splitting data across several nodes (e.g. “A-C” - node1; “D-F” - node2, etc) - whole DB does not fit in one server memory

Hashing request data to determine storage node

2 tier architecture:

1) Load balancing tier evenly distributing traffic between available nodes - each LB is identical

2) Data storage tier, only processing relevant requests, each node only stores it’s chunk/shard of entire “spread out” DB

Sharding architecture

Dynamic scalingCloud based hosting charges are usually time based

Local continental data centres are needed

Traffic usually fluctuates significantly during the day, week, month and year

Cloud based hosting allows quick server/instance commissioning / decommissioning

Instances can be added as traffic trends grow and removed as they drop to save cost

Other areasAutomatic node updating (there can be 100’s to manage)

Monitoring and alerting (load, space, errors, etc)

Burn in - testing new code on a small cluster before upgrading whole network

Good security - firewalls, local user/file access, etc

Avoid having single points of failure

Old log near-line storage (e.g. Amazon Glacier)

Architecture, design, solutions

Any other “modules”?

Machine learning

What is machine learning?

Automated, algorithmic statistical data analysis and pattern detection

What?!

Used in advertising?

To help find repeatable actions with lowered risk and high expected outcome certainty

Meaning...Finding links between ad properties to buy more clicks or actions, e.g.

ad shown on site a, during lunch time, ad size 320x600, user from London, etc - CPC likelihood of 10%

user with iPhone, in Central Kiev, having been to dance club sites - 30% likelyhood of conversion to taxi advertising

Vendors and solutions

Vendors and solutionsApache Hadoop

Nginx

Erlang, OTP, etc

Aerospike

MongoDB

Amazon Redshift

Google Big Query

Dynamo

PostgreSQL

Memcache

Xtremedata

Vendors and solutionsDynDns

Nustar DNS

Nustar Quova Geo DB

Amazon Route53

Amazon Load Balancing

Real world examples

• Companies who have big data at their core

Google AdX / Double click

Online and mobile Advertising Exchange

Ad serving

Criteo

Conclusions

A complex, specialised industry and software development sub-category

Technically challenging by an order of magnitude

NOT only for “special” people - anybody can get in - I did

Genuinely interesting to work in

Questions?

The end

Thank you!