Computational Social Science, Lecture 10: Online Experiments

Experimental Design

Sergei VassilvitskiiColumbia University

Computational Social ScienceApril 5, 2013

Thursday, April 25, 13

Sergei Vassilvitskii

Measurement

“Half the money I spend on advertising is wasted; the trouble is, I don’t know which half.”

- John Wanamaker

Measurement

“Half the money I spend on advertising is wasted; the trouble is, I don’t know which half.”

- John Wanamaker, 1875

Helping John:

4Thursday, April 25, 13

Helping John:

Idea 1: Measure the final effect:– Track total store sales, compare to advertising budget

Idea 1:

Findings:– Total sales typically higher after intense advertising

Idea 1:

Findings:– Total sales typically higher after intense advertising

Problems:– Stores advertise when people tend to spend – Christmas shopping periods– Travel during the summer – Ski gear in winter, etc.

Correlation vs. Causation

Idea 1

Within Subject pre-test, post-test design.

Idea 2

“Measuring the online sales impact of an online ad or a paid-search campaign -- in which a company pays to have its link appear at the top of a page of search results -- is straightforward: We determine who has viewed the ad, then compare online purchases made by those who have and those who have not seen it."

Idea 2

“Measuring the online sales impact of an online ad or a paid-search campaign -- in which a company pays to have its link appear at the top of a page of search results -- is straightforward: We determine who has viewed the ad, then compare online purchases made by those who have and those who have not seen it." – Magid Abraham, CEO, President & Co-Founder of ComScore, in HBR

article (2008)

Idea 2

Measure the difference between people who see ads and who don’t.

Idea 2

Findings:– People who see the ads are more likely to react to them

Idea 2

Findings:– People who see the ads are more likely to react to them

Problems:– Ads are finely targeted. These are exactly the people who are likely to

click! – Don’t advertise cars in fashion magazines. – Even more extreme online -- which ads are shown depends on the

propensity of the user to click on the ad.

Idea 3

Matching:– Compare people in a group who saw an ad with people who are

similar, but didn’t see an ad, but are otherwise “the same.”

Idea 3

Problems:– Hard to define “the same.” Beware of lurking variables.

Ad Wear-out

What is the optimal number of times to show an ad?

Case Study: Ad Wear-out

Few:– Don’t want user to be annoyed

– No need to waste money if ad is ineffective

Many:– Make sure the user sees it

– Reinforce the message

What is the optimal number of times to show an ad?

Observational Study

Look through the data:– Find the users who saw the ad once

– Find the users who saw the ad many times

Observational Study

Look through the data:– Find the users who saw the ad once

– Find the users who saw the ad many times

Measure Revenue for the two sets of users: –

Conclusion: Limit the number of impressions

Correlations

Why did some users only see the ad once? – They must use the web differently

– : Sign on once a week to check email

– : Are always online

Correlations

Why did some users only see the ad once? – They must use the web differently

– : Sign on once a week to check email

– : Are always online

Correct conclusion:– People who visit the homepage often are unlikely to click on ads

– Have not measured the effect of wear-out

Idea 3

Problems:– Hard to define “the same.” Beware of lurking variables.

Simpson’s Paradox

Kidney Stones [Real Data].

You have Kidney stones. There are two treatments A & B. – Empirically, treatment A is effective 78% of time– Empirically, treatment B is effective 83% of time– Which one do you chose?

Simpson’s Paradox

Kidney Stones [Real Data].

You have Kidney stones. There are two treatments A & B. Digging into the data you see:

If they are large:– Treatment A is effective 73% of the time– Treatment B is effective 69% of the time

If they are small:– Treatment A is effective 93% of the time– Treatment B is effective 87% of the time

Simpson’s Paradox

If they are large:– Treatment A is effective 73% of the time– Treatment B is effective 69% of the time

If they are small:– Treatment A is effective 93% of the time– Treatment B is effective 87% of the time

Overall:– Treatment A is effective 78% of the time– Treatment B is effective 83% of the time

Simpson’s Paradox Summary Stats

Small 81/87 (93%) 234/270 (87%)

Large 192/263 (73%) 55/80 (69%)

Combined 273/350 (78%) 289/350 (83%)

Idea 3

Problems:– Hard to define “the same.” Beware of lurking variables.– Simpson’s Paradox

Getting at Causation

Randomized, Controlled Experiments. – Select a target population– Randomly decide whom to show the ad– Subjects cannot influence whether they are in the treatment or control

groups

Measuring Wear Out

Parallel Universe

Measuring Wear Out

Parallel Universe

Control Treatment

Measuring Wear Out

Parallel Universe

Control TreatmentControl Treatment

Creating Parallel Universes

When user first arrives:– Check browser cookie, assign to control or treatment group– Control group: shown PSA– Treatment group: shown ad– Treatment the same on repeated visits

Creating Parallel Universes

When user first arrives:– Check browser cookie, assign to control or treatment group– Control group: shown PSA– Treatment group: shown ad– Treatment the same on repeated visits

Advertising Effects:– Positive !– But smaller than reported through observational studies

Online Experiments

Advantages:

Online Experiments

Advantages:– Can reach tens of millions of people!

• Can estimate very small effects. Lewis et al., "Here, There, and Everywhere: Correlated Online Behaviors Can Lead to Overestimates of the Effects of Advertising." (WWW 2011). Estimate effects of 0.01%!

Online Experiments

– Can be relatively cheap (Mechanical Turk)

Online Experiments

– Can be relatively cheap– Can be recruit diverse subjects

• “20 students in a large Midwestern university.” Try to avoid subjects from WEIRD societies (Western, Educated, Industrialized, Rich, and Democratic).

WEIRD People

Which line is longer?

– Henrich, Joseph; Heine, Steven J.; Norenzayan, Ara (2010) : The weirdest people in the world?, Working Paper Series des Rates für Sozialund Wirtschaftsdaten

WEIRD People

Online Experiments

• Can estimate very small effects.

– Access: subjects in other countries, geographically diverse– Can be quick

Online Experiments

• Can estimate very small effects.

– Access: subjects in other countries, geographically diverse– Can be quick

Challenges:– Limited choice in range of treatments (no MRI studies)– Do people behave differently offline?

External Validity

Major Challenge in all lab experiments:– Virtual and physical labs– Do findings hold outside the lab?

Enter:– Natural Experiments

Natural Experiments

The experimental condition:– Is not decided by the experimenter– But is exogenous (subjects have no effect on the results)

Case Study: Ad-wear out

Back to Ad-wear out.

Natural Experiment:– When there were two competing campaigns, the Yahoo! ad server

decided which campaign to show at random!

– This was by engineering design -- both campaigns got an equal share of pageviews. (Less complex, easy to distribute than a round robin system)

Few:– Don’t want user to be annoyed

– No need to waste money if ad is ineffective

Many:– Make sure the user sees it

– Reinforce the message

Natural Experiment:– When there were two competing campaigns, the Yahoo! ad server

decided which campaign to show at random!

– This was by engineering design -- both campaigns got an equal share of pageviews. (Less complex, easy to distribute than a round robin system)

Experiments:– Compare behavior of people who saw the same total number of ads,

but different number of each campaign.

Yes:– Some advertisements see a 5x drop in click-through rate after the

first exposure

– These typically have very high click-through rates

No:– Others see no decrease in click-through rate even after ten exposures

– Have lower, but steady click-through rates

Case Study 2: Yelp

Does a higher Yelp Rating lead to higher revenue?

How to do the experiment?

Case Study 2: Yelp

How to do the experiment?– Observational -- no causality.– Control -- deception.– Natural?

Case Study 2: Yelp

Natural Experiment:– Yelp rounds ratings to the nearest half star.– 4.24 becomes 4 stars, 4.26 is 4.5 stars

Case Study 2: Yelp

Data:– Raw ratings from Yelp– Restaurant revenue (from tax records)

Case Study 2: Yelp

Data:– Raw ratings from Yelp– Restaurant revenue (from tax records) – Finding: a one star increase leads to a 5-9% increase in revenue.

Case Study 3: Badges

How do Badges influence user behavior?

Specifically:– The “epic” badge on stackoverflow.

– Awarded after hitting the maximum number of points (through posts, responses, etc.) on 50 distinct days.

How do Badges influence user behavior?

Specifically:– The “epic” badge on stackoverflow.

– Awarded after hitting the maximum number of points (through posts, responses, etc.) on 50 distinct days.

Experimental Design:– Within subject pre-post test (again)

– Look at user behavior before/after receiving badge

– Averaged over different user, different timings, (hopefully) all other factors.

Results:

Overall

Experimental Design is hard!– Be extra skeptical in your analyses. Lots of spurious correlations

Experiments:– Natural and Controlled are best way to measure effects

Observational Data:– Sometimes best you can do– Can lead interesting descriptive insights– But beware of correlations!

Computational Social Science, Lecture 10: Online Experiments

Technology

Computational Experiments with Cross and Crooked Cross Cuts · Computational Experiments with Cross and Crooked Cross Cuts Sanjeeb Dash IBM Research sanjeebd@us.ibm.com Oktay Gunl¨

COMPUTATIONAL METHODS AND EXPERIMENTS IN ANALYTIC … · 2008-02-01 · arXiv:math/0412181v1 [math.NT] 8 Dec 2004 COMPUTATIONAL METHODS AND EXPERIMENTS IN ANALYTIC NUMBER THEORY MICHAEL

Computational Physics (Lecture 15)

Computational Physics (Lecture 6)

Computational complexity with experiments as oraclesrspa.royalsocietypublishing.org/content/royprsa/464/2098/2777.full.pdf · Computational complexity with experiments as oracles

Reproducible computational experiments using SCons · merical experiments in a very broad sense. These numerical experiments (or \computational recipes") can be done not only using

Computational Physics (Lecture 13)

Computational Genomics Lecture #2b

Computational Physics (Lecture 4)

Experiments in Computational Metaphysics: Gödel’s …page.mi.fu-berlin.de/cbenzmueller/papers/C52.pdf · Experiments in Computational Metaphysics: Gödel’s Proof of God’s Existence

Computational Neuroscience Lecture 7

IT3010 Lecture 9 - Experiments

PSWEEP: A Lightweight Pattern for Distributed Computational Experiments

Experiments in Computational Metaphysics: Gödel's Proof of Godpage.mi.fu-berlin.de/cbenzmueller/papers/2015-AISSQ.pdf · Experiments in Computational Metaphysics: Gödel’s Proof

6.3 Computational Experiments for Science and Engineering ...€¦ · dictive power, computational models, the machinery of computational experiments, have become a key element of

Data mining for materials: Computational experiments with AB

Chemical Lecture Experiments 1926

Computational Statistical Experiments in Matlablamastex.org/courses/STAT459/CSEBook.pdfVision Satement Computational Statistical Experiments in Matlab This book is intended as an undergraduate

Computational Physics (Lecture 10)

Lecture 14: Controlled Experiments