Upload
ben-paul
View
219
Download
3
Embed Size (px)
Citation preview
A/B test with problematic data
Ben PaulMay 20, 2015
Background
• It has previously been shown that user experience on our site is better if users first answer a fewquestions about their preferences.
• We are testing a new landing page to determine if it will cause more users to answer at least onequestion about their preferences.
• If the new landing page causes any statistically significant increase in conversion rate (percentage ofusers who complete at least one question), then it will be considered a success.
Hypotheses
• The new landing page will cause a statistically significant increase in conversion rate.
Method
• Randomly assign 50% of users to a control group that will be shown the old landing page and the other50% of users to a treatment group that will be shown the new landing page.
• Track whether each user answers at least one question or not.
• Run a z-test to determine if the treatment group had a greater conversion rate than the control group,with the conventional cuto� for statistical significance of p < 0.05, two-tailed.
Analysis
Set up environment
library("plyr")
library("dplyr", warn.conflicts = FALSE) # I�m aware of the plyr/dplyr conflicts
library("scales")
knitr::opts_chunk$set(comment = NA) # remove hashes in output
Read data
dat <- read.csv("data/takehome.csv")
1
Clean data
Handle data types Check that data types are appropriate.
summary(dat); str(dat);
user_id ts ab
Min. :2.325e+04 Min. :1.357e+09 control : 90815
1st Qu.:2.488e+09 1st Qu.:1.357e+09 treatment:100333
Median :4.997e+09 Median :1.357e+09
Mean :4.998e+09 Mean :1.357e+09
3rd Qu.:7.508e+09 3rd Qu.:1.357e+09
Max. :1.000e+10 Max. :1.357e+09
landing_page converted
new_page:95574 Min. :0.0000
old_page:95574 1st Qu.:0.0000
Median :0.0000
Mean :0.1011
3rd Qu.:0.0000
Max. :1.0000
�data.frame�: 191148 obs. of 5 variables:
$ user_id : num 9.64e+09 2.46e+09 9.67e+09 2.25e+09 7.81e+09 ...
$ ts : num 1.36e+09 1.36e+09 1.36e+09 1.36e+09 1.36e+09 ...
$ ab : Factor w/ 2 levels "control","treatment": 2 2 1 2 1 1 1 2 2 1 ...
$ landing_page: Factor w/ 2 levels "new_page","old_page": 1 1 2 1 2 2 2 1 2 2 ...
$ converted : int 0 0 0 0 0 1 1 0 0 0 ...
Data types appear to be appropriate. The independent variables “ab” and “landing_page” each havetwo levels, corresponding to the control condition (“control”/“old_page”) and the treatment condition(“treatment”/“new_page”).
The dependent variable “converted” is an integer with just two possible values representing whether the useranswered at least one question (1) or not (0). Let’s ensure that it has no other values:
unique(dat$converted)
[1] 0 1
The dependent variable has no other values besides 0 and 1, so no cleaning is required.
In summary, there are no problematic data types or values apparent from initial inspection.
Handle duplicates The documentation indicated that each user should be assigned to just one condition,either the control group (ab = “control”), which was shown the old landing page (landing_page = “old_page”),or the treatment group (ab = “treatment”), which was shown the new landing page (landing_page =“new_page”).
Therefore, each user_id should have just one row in the data set, with information about the one conditionthey were assigned as well as the one landing page they were shown. If any user has more than one row,something may have gone wrong and we will need to explore the data to determine how to handle it. Let’sstart by determining if this is an issue.
2
# find user_ids with multiple rows
dat$multi_obs <- (duplicated(dat$user_id) | duplicated(dat$user_id, fromLast = TRUE))
# print the number of rows with this issue
dat[dat$multi_obs, ] %>% nrow
[1] 9528
# print the percentage of rows that have this issue
percent((dat[dat$multi_obs, ] %>% nrow) / (dat %>% nrow))
[1] "4.98%"
These calculations show that some users do have multiple rows. These multi-observation users account for9,528 observations, or 5% of all observations. This is concerning.
To understand this issue more fully, the next step will be to visually inspect a sample of multi-observationusers’ data.
# print a sample of multi-observation users� data
dat[dat$multi_obs, ] %>%
arrange(user_id, ts) %>% # show each user�s data chronologically
head(30) %>%
mutate(
# convert timestamps to human readable form
ts = ts %>% as.POSIXct(origin = "1970-01-01", tz = "GMT")
)
user_id ts ab landing_page converted multi_obs
1 203042 2013-01-01 02:56:48 treatment new_page 0 TRUE
2 203042 2013-01-01 02:56:49 treatment old_page 1 TRUE
3 2394489 2013-01-01 11:23:54 treatment new_page 0 TRUE
4 2394489 2013-01-01 11:23:55 treatment old_page 1 TRUE
5 2695427 2013-01-01 18:37:58 treatment new_page 0 TRUE
6 2695427 2013-01-01 18:37:59 treatment old_page 0 TRUE
7 3789396 2013-01-01 01:05:13 treatment new_page 0 TRUE
8 3789396 2013-01-01 01:05:14 treatment old_page 0 TRUE
9 6213582 2013-01-01 12:43:13 treatment new_page 0 TRUE
10 6213582 2013-01-01 12:43:14 treatment old_page 0 TRUE
11 7647078 2013-01-01 20:04:34 treatment new_page 0 TRUE
12 7647078 2013-01-01 20:04:35 treatment old_page 1 TRUE
13 11584819 2013-01-01 12:53:41 treatment new_page 0 TRUE
14 11584819 2013-01-01 12:53:42 treatment old_page 0 TRUE
15 11803291 2013-01-01 21:33:00 treatment new_page 0 TRUE
16 11803291 2013-01-01 21:33:01 treatment old_page 0 TRUE
17 22522327 2013-01-01 12:45:08 treatment new_page 0 TRUE
18 22522327 2013-01-01 12:45:09 treatment old_page 0 TRUE
19 22577434 2013-01-01 06:13:05 treatment new_page 0 TRUE
20 22577434 2013-01-01 06:13:06 treatment old_page 0 TRUE
21 24144768 2013-01-01 21:42:04 treatment new_page 0 TRUE
22 24144768 2013-01-01 21:42:05 treatment old_page 0 TRUE
23 25758261 2013-01-01 14:52:11 treatment new_page 0 TRUE
3
24 25758261 2013-01-01 14:52:12 treatment old_page 0 TRUE
25 29616796 2013-01-01 02:17:18 treatment new_page 0 TRUE
26 29616796 2013-01-01 02:17:19 treatment old_page 0 TRUE
27 32617932 2013-01-01 21:50:20 treatment new_page 0 TRUE
28 32617932 2013-01-01 21:50:21 treatment old_page 1 TRUE
29 32786569 2013-01-01 07:48:23 treatment new_page 0 TRUE
30 32786569 2013-01-01 07:48:24 treatment old_page 1 TRUE
In this sample of multi-observation users, it appears that such users see the new page first and then land onthe old page one second later. Inspection of all multi-observation user data verified this.Inspection of this sample also raised the question of whether multi-observation users are primarily in thetreatment group. Analysis of all multi-observation user data (below) confirmed that 99.9% of multi-observationusers were assigned to the treatment group, and therefore should have been shown only the new page. However,what actually happened is that multi-observation users saw the new page for one second before ultimatelylanding on the old page, which was intended for the control group. This behavior does not match the intendedexperimental design.The sample data also suggest that multi-observation users never convert on the new page, which wouldmake sense since it was shown for just one second before they landed on the old page. Analysis of allmulti-observation user data (below) confirmed that none of these users converted on the new page.
# calculate percentage of multi-observation users assigned only to the treatment group
multi_summary <- dat[dat$multi_obs, ] %>%
group_by(user_id) %>%
summarize(all_treatment = as.numeric(all(ab == "treatment"))) # if user�s rows are all "treatment" -> 1
percent(sum(multi_summary$all_treatment) / nrow(multi_summary))
[1] "99.9%"
# count number of times multi-observation users converted on the new page
dat[dat$multi_obs, ] %>%
filter(landing_page == "new_page", converted == 1) %>%
nrow
[1] 0
The calculations above demonstrate that, as previously discussed, 99.9% of multi-observation users were inthe treatment group, but none of them converted from the new landing page.It would be possible to correct such users’ data by changing their label from “treatment” to “control” andby removing the data from when they loaded the new page for a second. However, their responses mayhave been influenced by a glitch in the website, which would not be generalizable to the wider audience forwhich these changes are intended. In addition, they were not exposed to the experimental design as intended.Therefore, their data would be di�cult to interpret and should be removed altogether.Note that the decision to remove their data entirely would be defensible only if multi-observation usersrepresented a random subset of the population under test. If multi-observation users represent a non-randomsubset (e.g., people who use Internet Explorer), it would not be wise to delete their data, as it would limit thegeneralizability of the results (e.g., results would then only apply to people who don’t use Internet Explorer).Therefore, if the glitch a�ected a non-random subset of users, I would advise running more users through thestudy after fixing the glitch.For the sake of this assignment, I will assume this is due to a random glitch and we can remove their data.
4
dat <- dat[!dat$multi_obs, ]
Check for further experimental errors As previously mentioned, users in the control group shouldonly see the old page, and users in the treatment group should only see the new page.
Therefore, after we removed users with multiple observations, if there are still any users left that saw thewrong page given their condition, we will need to decide how to handle them.
# check that treatment and control groups saw their corresponding pages
table(dat$ab, dat$landing_page)
new_page old_page
control 0 90809
treatment 90811 0
The table indicates that we have fully removed the problematic users; each condition is now associated withthe correct landing page.
Analyze data
Now that the data has been cleaned, we can conduct a z-test to determine if there was an e�ect of experimentalcondition on conversion rate.
tbl <- table(dat$ab, dat$converted)
res <- tbl %>% prop.test # aka z-test
names(res$estimate) <- c("control", "treatment") # make results readable
# invert point estimates to show conversion rate rather than non-conversion rate
rates <- (1 - res$estimate)
# format confidence interval of difference as percentage
diff.conf.int <- res$conf.int
# to help with interpretation, also calculate conversion rate confidence interval for each group separately
control.conf.int <- prop.test(tbl["control", "1"], sum(tbl["control", ])) %>%
.$conf.int
treatment.conf.int <- prop.test(tbl["treatment", "1"], sum(tbl["treatment", ])) %>%
.$conf.int
Results
Examine results.
control.conf.int %>% round(3) %>% percent
[1] "9.8%" "10.2%"
5
treatment.conf.int %>% round(3) %>% percent
[1] "10.5%" "10.9%"
rates %>% round(3) %>% sapply(percent)
control treatment
"10%" "10.7%"
diff.conf.int %>% round(3) %>% percent
[1] "0.3%" "0.9%"
res["p.value"]
$p.value
[1] 1.104298e-05
The conversion rate of the old page is 10.0% (95% confidence interval, 9.8% - 10.2%). The conversion rate ofthe new page is 10.7% (95% confidence interval, 10.5% - 10.9%). The new page has a higher conversion ratethan the old page (95% confidence interval of di�erence, 0.3% - 0.9%), p < 0.001.If the decision to remove the problematic users was correct, then we can say with 95% confidence that thenew page’s conversion rate is 3 - 9% greater than the old page’s conversion rate.
Discussion
Given the higher conversion rate of the new landing page, I would recommend we switch all users over to itand to monitor whether the conversion rate increases as expected.Regarding the discrepancy between our data and the third party’s data, I believe our data is more accuratebecause we have cleaned problematic observations from it. There is no reason to believe that the third partycleaned the data, although I would contact them to confirm this.I would explain the discrepancy to the project manager by stating that some people were mislabeled as havingseen the new page, when really they saw the old page. Acme’s system isn’t set up to catch these problems,but as a result of her request we were able to find and delete the bad data, uncovering the significant resultsthat she suspected were there all along.To protect future experiments, it would be important to understand why these glitches occurred. Therefore, Iwould discuss the issue with developers and quality assurance analysts and try to reproduce the problematicbehavior. If I’m not able to, I would o�er an incentive to anyone in the company who could. (This strategyhas been successful for me in my current company: employees will actually race to reproduce an issue to earna gold star.) Once the conditions for reproduction are identified, we can determine how to prevent this glitchin the future.I would also suggest we set up monitoring in similar experiments to ensure that these problematic conditionsdon’t occur again. In particular, (a) each user should have just one observation, and (b) each experimentalcondition should be associated with the expected behavior (e.g., the treatment condition should be associatedwith only new page and the control condition should be associated with only the old page). A first stepwould be to set up as a daily email indicating whether (a) and (b) are satisfied. As we grow more confidentin the system, we could have it only email us if (a) and (b) are not satisfied.Whenever problems arise, we should analyze what went wrong, explore whether we need to delete or correctthe relevant data, and continue to implement more safeguards to prevent similar problems in the future.
6