7

A/B test with problematic data

Embed Size (px)

Citation preview

Page 1: A/B test with problematic data
Page 2: A/B test with problematic data

A/B test with problematic data

Ben PaulMay 20, 2015

Background

• It has previously been shown that user experience on our site is better if users first answer a fewquestions about their preferences.

• We are testing a new landing page to determine if it will cause more users to answer at least onequestion about their preferences.

• If the new landing page causes any statistically significant increase in conversion rate (percentage ofusers who complete at least one question), then it will be considered a success.

Hypotheses

• The new landing page will cause a statistically significant increase in conversion rate.

Method

• Randomly assign 50% of users to a control group that will be shown the old landing page and the other50% of users to a treatment group that will be shown the new landing page.

• Track whether each user answers at least one question or not.

• Run a z-test to determine if the treatment group had a greater conversion rate than the control group,with the conventional cuto� for statistical significance of p < 0.05, two-tailed.

Analysis

Set up environment

library("plyr")

library("dplyr", warn.conflicts = FALSE) # I�m aware of the plyr/dplyr conflicts

library("scales")

knitr::opts_chunk$set(comment = NA) # remove hashes in output

Read data

dat <- read.csv("data/takehome.csv")

1

Page 3: A/B test with problematic data

Clean data

Handle data types Check that data types are appropriate.

summary(dat); str(dat);

user_id ts ab

Min. :2.325e+04 Min. :1.357e+09 control : 90815

1st Qu.:2.488e+09 1st Qu.:1.357e+09 treatment:100333

Median :4.997e+09 Median :1.357e+09

Mean :4.998e+09 Mean :1.357e+09

3rd Qu.:7.508e+09 3rd Qu.:1.357e+09

Max. :1.000e+10 Max. :1.357e+09

landing_page converted

new_page:95574 Min. :0.0000

old_page:95574 1st Qu.:0.0000

Median :0.0000

Mean :0.1011

3rd Qu.:0.0000

Max. :1.0000

�data.frame�: 191148 obs. of 5 variables:

$ user_id : num 9.64e+09 2.46e+09 9.67e+09 2.25e+09 7.81e+09 ...

$ ts : num 1.36e+09 1.36e+09 1.36e+09 1.36e+09 1.36e+09 ...

$ ab : Factor w/ 2 levels "control","treatment": 2 2 1 2 1 1 1 2 2 1 ...

$ landing_page: Factor w/ 2 levels "new_page","old_page": 1 1 2 1 2 2 2 1 2 2 ...

$ converted : int 0 0 0 0 0 1 1 0 0 0 ...

Data types appear to be appropriate. The independent variables “ab” and “landing_page” each havetwo levels, corresponding to the control condition (“control”/“old_page”) and the treatment condition(“treatment”/“new_page”).

The dependent variable “converted” is an integer with just two possible values representing whether the useranswered at least one question (1) or not (0). Let’s ensure that it has no other values:

unique(dat$converted)

[1] 0 1

The dependent variable has no other values besides 0 and 1, so no cleaning is required.

In summary, there are no problematic data types or values apparent from initial inspection.

Handle duplicates The documentation indicated that each user should be assigned to just one condition,either the control group (ab = “control”), which was shown the old landing page (landing_page = “old_page”),or the treatment group (ab = “treatment”), which was shown the new landing page (landing_page =“new_page”).

Therefore, each user_id should have just one row in the data set, with information about the one conditionthey were assigned as well as the one landing page they were shown. If any user has more than one row,something may have gone wrong and we will need to explore the data to determine how to handle it. Let’sstart by determining if this is an issue.

2

Page 4: A/B test with problematic data

# find user_ids with multiple rows

dat$multi_obs <- (duplicated(dat$user_id) | duplicated(dat$user_id, fromLast = TRUE))

# print the number of rows with this issue

dat[dat$multi_obs, ] %>% nrow

[1] 9528

# print the percentage of rows that have this issue

percent((dat[dat$multi_obs, ] %>% nrow) / (dat %>% nrow))

[1] "4.98%"

These calculations show that some users do have multiple rows. These multi-observation users account for9,528 observations, or 5% of all observations. This is concerning.

To understand this issue more fully, the next step will be to visually inspect a sample of multi-observationusers’ data.

# print a sample of multi-observation users� data

dat[dat$multi_obs, ] %>%

arrange(user_id, ts) %>% # show each user�s data chronologically

head(30) %>%

mutate(

# convert timestamps to human readable form

ts = ts %>% as.POSIXct(origin = "1970-01-01", tz = "GMT")

)

user_id ts ab landing_page converted multi_obs

1 203042 2013-01-01 02:56:48 treatment new_page 0 TRUE

2 203042 2013-01-01 02:56:49 treatment old_page 1 TRUE

3 2394489 2013-01-01 11:23:54 treatment new_page 0 TRUE

4 2394489 2013-01-01 11:23:55 treatment old_page 1 TRUE

5 2695427 2013-01-01 18:37:58 treatment new_page 0 TRUE

6 2695427 2013-01-01 18:37:59 treatment old_page 0 TRUE

7 3789396 2013-01-01 01:05:13 treatment new_page 0 TRUE

8 3789396 2013-01-01 01:05:14 treatment old_page 0 TRUE

9 6213582 2013-01-01 12:43:13 treatment new_page 0 TRUE

10 6213582 2013-01-01 12:43:14 treatment old_page 0 TRUE

11 7647078 2013-01-01 20:04:34 treatment new_page 0 TRUE

12 7647078 2013-01-01 20:04:35 treatment old_page 1 TRUE

13 11584819 2013-01-01 12:53:41 treatment new_page 0 TRUE

14 11584819 2013-01-01 12:53:42 treatment old_page 0 TRUE

15 11803291 2013-01-01 21:33:00 treatment new_page 0 TRUE

16 11803291 2013-01-01 21:33:01 treatment old_page 0 TRUE

17 22522327 2013-01-01 12:45:08 treatment new_page 0 TRUE

18 22522327 2013-01-01 12:45:09 treatment old_page 0 TRUE

19 22577434 2013-01-01 06:13:05 treatment new_page 0 TRUE

20 22577434 2013-01-01 06:13:06 treatment old_page 0 TRUE

21 24144768 2013-01-01 21:42:04 treatment new_page 0 TRUE

22 24144768 2013-01-01 21:42:05 treatment old_page 0 TRUE

23 25758261 2013-01-01 14:52:11 treatment new_page 0 TRUE

3

Page 5: A/B test with problematic data

24 25758261 2013-01-01 14:52:12 treatment old_page 0 TRUE

25 29616796 2013-01-01 02:17:18 treatment new_page 0 TRUE

26 29616796 2013-01-01 02:17:19 treatment old_page 0 TRUE

27 32617932 2013-01-01 21:50:20 treatment new_page 0 TRUE

28 32617932 2013-01-01 21:50:21 treatment old_page 1 TRUE

29 32786569 2013-01-01 07:48:23 treatment new_page 0 TRUE

30 32786569 2013-01-01 07:48:24 treatment old_page 1 TRUE

In this sample of multi-observation users, it appears that such users see the new page first and then land onthe old page one second later. Inspection of all multi-observation user data verified this.Inspection of this sample also raised the question of whether multi-observation users are primarily in thetreatment group. Analysis of all multi-observation user data (below) confirmed that 99.9% of multi-observationusers were assigned to the treatment group, and therefore should have been shown only the new page. However,what actually happened is that multi-observation users saw the new page for one second before ultimatelylanding on the old page, which was intended for the control group. This behavior does not match the intendedexperimental design.The sample data also suggest that multi-observation users never convert on the new page, which wouldmake sense since it was shown for just one second before they landed on the old page. Analysis of allmulti-observation user data (below) confirmed that none of these users converted on the new page.

# calculate percentage of multi-observation users assigned only to the treatment group

multi_summary <- dat[dat$multi_obs, ] %>%

group_by(user_id) %>%

summarize(all_treatment = as.numeric(all(ab == "treatment"))) # if user�s rows are all "treatment" -> 1

percent(sum(multi_summary$all_treatment) / nrow(multi_summary))

[1] "99.9%"

# count number of times multi-observation users converted on the new page

dat[dat$multi_obs, ] %>%

filter(landing_page == "new_page", converted == 1) %>%

nrow

[1] 0

The calculations above demonstrate that, as previously discussed, 99.9% of multi-observation users were inthe treatment group, but none of them converted from the new landing page.It would be possible to correct such users’ data by changing their label from “treatment” to “control” andby removing the data from when they loaded the new page for a second. However, their responses mayhave been influenced by a glitch in the website, which would not be generalizable to the wider audience forwhich these changes are intended. In addition, they were not exposed to the experimental design as intended.Therefore, their data would be di�cult to interpret and should be removed altogether.Note that the decision to remove their data entirely would be defensible only if multi-observation usersrepresented a random subset of the population under test. If multi-observation users represent a non-randomsubset (e.g., people who use Internet Explorer), it would not be wise to delete their data, as it would limit thegeneralizability of the results (e.g., results would then only apply to people who don’t use Internet Explorer).Therefore, if the glitch a�ected a non-random subset of users, I would advise running more users through thestudy after fixing the glitch.For the sake of this assignment, I will assume this is due to a random glitch and we can remove their data.

4

Page 6: A/B test with problematic data

dat <- dat[!dat$multi_obs, ]

Check for further experimental errors As previously mentioned, users in the control group shouldonly see the old page, and users in the treatment group should only see the new page.

Therefore, after we removed users with multiple observations, if there are still any users left that saw thewrong page given their condition, we will need to decide how to handle them.

# check that treatment and control groups saw their corresponding pages

table(dat$ab, dat$landing_page)

new_page old_page

control 0 90809

treatment 90811 0

The table indicates that we have fully removed the problematic users; each condition is now associated withthe correct landing page.

Analyze data

Now that the data has been cleaned, we can conduct a z-test to determine if there was an e�ect of experimentalcondition on conversion rate.

tbl <- table(dat$ab, dat$converted)

res <- tbl %>% prop.test # aka z-test

names(res$estimate) <- c("control", "treatment") # make results readable

# invert point estimates to show conversion rate rather than non-conversion rate

rates <- (1 - res$estimate)

# format confidence interval of difference as percentage

diff.conf.int <- res$conf.int

# to help with interpretation, also calculate conversion rate confidence interval for each group separately

control.conf.int <- prop.test(tbl["control", "1"], sum(tbl["control", ])) %>%

.$conf.int

treatment.conf.int <- prop.test(tbl["treatment", "1"], sum(tbl["treatment", ])) %>%

.$conf.int

Results

Examine results.

control.conf.int %>% round(3) %>% percent

[1] "9.8%" "10.2%"

5

Page 7: A/B test with problematic data

treatment.conf.int %>% round(3) %>% percent

[1] "10.5%" "10.9%"

rates %>% round(3) %>% sapply(percent)

control treatment

"10%" "10.7%"

diff.conf.int %>% round(3) %>% percent

[1] "0.3%" "0.9%"

res["p.value"]

$p.value

[1] 1.104298e-05

The conversion rate of the old page is 10.0% (95% confidence interval, 9.8% - 10.2%). The conversion rate ofthe new page is 10.7% (95% confidence interval, 10.5% - 10.9%). The new page has a higher conversion ratethan the old page (95% confidence interval of di�erence, 0.3% - 0.9%), p < 0.001.If the decision to remove the problematic users was correct, then we can say with 95% confidence that thenew page’s conversion rate is 3 - 9% greater than the old page’s conversion rate.

Discussion

Given the higher conversion rate of the new landing page, I would recommend we switch all users over to itand to monitor whether the conversion rate increases as expected.Regarding the discrepancy between our data and the third party’s data, I believe our data is more accuratebecause we have cleaned problematic observations from it. There is no reason to believe that the third partycleaned the data, although I would contact them to confirm this.I would explain the discrepancy to the project manager by stating that some people were mislabeled as havingseen the new page, when really they saw the old page. Acme’s system isn’t set up to catch these problems,but as a result of her request we were able to find and delete the bad data, uncovering the significant resultsthat she suspected were there all along.To protect future experiments, it would be important to understand why these glitches occurred. Therefore, Iwould discuss the issue with developers and quality assurance analysts and try to reproduce the problematicbehavior. If I’m not able to, I would o�er an incentive to anyone in the company who could. (This strategyhas been successful for me in my current company: employees will actually race to reproduce an issue to earna gold star.) Once the conditions for reproduction are identified, we can determine how to prevent this glitchin the future.I would also suggest we set up monitoring in similar experiments to ensure that these problematic conditionsdon’t occur again. In particular, (a) each user should have just one observation, and (b) each experimentalcondition should be associated with the expected behavior (e.g., the treatment condition should be associatedwith only new page and the control condition should be associated with only the old page). A first stepwould be to set up as a daily email indicating whether (a) and (b) are satisfied. As we grow more confidentin the system, we could have it only email us if (a) and (b) are not satisfied.Whenever problems arise, we should analyze what went wrong, explore whether we need to delete or correctthe relevant data, and continue to implement more safeguards to prevent similar problems in the future.

6