Upload
others
View
13
Download
0
Embed Size (px)
Citation preview
Case-control studies, regression and survival analysis
Tyler Moore
CSE 7338Computer Science & Engineering Department, SMU, Dallas, TX
Lectures 6–7
Outline
1 Case-control studies
2 Regression and survival analysis
2 / 84
Case-control studies
Guide to exploringg data
Type of Data Exploration Statistics RByEx
1 numerical variable
0 2 4 6 8
0.0
0.4
0.8
ecdf(br$logbreach)
x
Fn(
x)
0 2 4 6 8
log(#records breached)
●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●
●●●●●●●●●●
●●●●●●●●●●
●●●●
●●●●●●●●
●●●●●●●●●
●●●●
●●●●●●
●●●
●●●●●●●●●●●●●
●●●●●●
●●●●
●●●●●●●
●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●
●●●●
●●●
●●●●●
●●●●●●
●●●
●●●●●●●●●●●●
●●●
●●●
●●●●●●
●●●●
●●●●●●●●●●●●●●●●●●●●
●●●●●●
●●●●●
●●●●
●●●●
●●●●●●●●●●●
●●●●●●●●●●
●●●●●
●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●
●●●
●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●
●●●●●
●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●
●●●●●●●●
●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●
●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●
●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●
●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●
●●●●●●●●●●
●●●●●●●●●●●●●
●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●
●●●●●●●●
●●●●●●●
●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●
●●●●●●●
●●●●●
●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●
●●●●●
●●●●●●
●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●
●●●●●●●●●●
●●●●●●●●●●●●●●●
●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●
●●●●●●●
●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●
●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●
●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●
●●●●●●
●●●●●●●●●●●
●●●●●●●●●●
●●●●●●●●●●●
●●●●●●●●●●●●●●●●●
●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●
●●●●●
●●●●●●●●
●●●●●●
●●●●●●●●●●●●●●●●●●
●●●●
●●●●●●●●
●●●●●●
●●●●●
●●●●●●●●●
●●●●●●
●●●●●
●●●●●●
●●●●●●●●●●●●●●●●●
●●●●●●
●●●●
●●●●●●●
●●●●●●●●●●●●●●●●●●●●
●●●●●●
●●●●●●
●●●
●●●●●●●●●
●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●
●●●●●
●●●●●●
●●●●●●●●●
●●●●●●●
●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●
●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●
●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●
one way t-test, Wilcox test 6.3
1 categorical variable
CARD HACK PHYS STAT
040
080
0
– 3.1# categories=2 – prop.test 6.2
1 categorical, 1 numerical
●●
●
●
●
●●
●●●●●
BSF EDU
02
46
8
Organization Type
log(
#rec
ords
bre
ache
d)
0 2 4 6 8
FALS
ET
RU
E
log(#records breached)
Bre
ach
type
●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●
●●●●●●●
●●●●●●●●
●●●●
●●●●●
●●●●●
●●●
●●●●
●●●●●●●●●●●●
●●●●●
●●●●
●●●●●●
●●●●
●●●●●●●●●●●●●●●●●●
●●●
●●●
●●●
●●●●●●●●●●
●●●
●●●●●●●
●●●●●
●●●●●
●●●●●●●●●●●●●●
●●●●●●
●●●●●
●●●
●●●
●●●●●●●●
●●●●●●●●●
●●●●
●●●●●●●●●●●●●●●●
●●●●●●●●●●●
●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●
●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●
●●●●●
●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●
●●●●●●●●
●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●
●●●●●●●●
●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●
●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●
●●●●●●●●●●●●●●●●
●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●
●●●●●●●●●
●●●●●●●●●●●●
●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●
●●●●●●●●
●●●●●●●
●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●
●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●
●●●●●●●●●●●
●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●
●●●●●
●●●●●●
●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●
●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●
●●●●●●●●●●
●●●●●●●●●●●●●●
●●●●●●●●
●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●
●●●●●●●●
●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●
●●●●●●
●●●●●●●●●
●●●●●●●●●●
●●●●●●●●●●●
●●●●●●●●●●●●●●●●
●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●
●●●●●
●●●●●●●
●●●●●●
●●●●●●●●●●●●●●●●●
●●●●
●●●●●●●●
●●●●●
●●●●
●●●●●●●●●
●●●●●●
●●●
●●●●●●
●●●●●●●●●●●●●●●
●●●●●
●●●●
●●●●●
●●●●●●●●●●●●●●●●●●
●●●●●●
●●●●●●
●●●
●●●●●●●●●
●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●
●●●●●
●●●●●●
●●●●●●●●●
●●●●●●
●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●
●●●●●
●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●
●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●
●●●●●●●●●●●●●●●●●●
●●●●●●●●
●●●●●
●●●●
●●●●●●●●●●●●●●●●●●
●●●●●●●
●●●●●●●●●●●●
●●●●●
●●●●●●●●●●●
●●●●●●●●
●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●
●● ●●
anova, Permutation 10# categories=2 – 2-way t, Wilcox test, Perm. 6.4
2 categorical variables
TOH
BSF BSO BSR EDU GOV MED NGO
CA
RD
DIS
CH
AC
KIN
SD
PH
YS
PO
RT
STA
TU
NK
N
χ2 test 3.2–3.5
4 / 84
Case-control studies
Guide to analyzing data
After visual exploration and any descriptive statistics, you may wantto investigate relationships between variables more closely
In particular, you can investigate how one or more explanatory (akaindependent) variables influences response (aka dependent) variables
Statistical Method Response Variable Explanatory Variable
Odds ratios Binary (case/control) Categorical variables (1 at a time)Linear regression Numerical One or more variables (numerical or categorical)Logistic regression Binary One or more variables (numerical or categorical)Survival analysis Time to event One or more variables (numerical or categorical)
5 / 84
Notes
Notes
Notes
Notes
Case-control studies
Identifying risk factors in epidemiology
6 / 84
Case-control studies
Case-control studies and cybercrime
In a perfect world, we could measure security using randomizedcontrolled experiments similar to medicine
But most security data is observational – we can’t select subjects andapply treatments to a subset
Instead, we can observe that some targets are victimized, while othervulnerable targets are not
Crucially, this observation happens after the fact (if at all)
Case-control study method is ideal for identifying risk factors when allyou have is observational data
7 / 84
Case-control studies
Case-control study design
Population
Case Control
Exposed Not Exposed Exposed Not Exposed
Present
Past
8 / 84
Case-control studies
Case-control study design: smoking and lung cancer
Population:Doctors
Case: LungCancer
Control: NoLung Cancer
Exposed:Smoker
Not Exposed:Non-smoker
Exposed:Smoker
Not Exposed:Non-smoker
Present
Past
9 / 84
Notes
Notes
Notes
Notes
Case-control studies
The odds ratio
Case (afflicted) Control (not afflicted)Exposed (has risk factor) p11 p10
Not exposed (no risk factor) p01 p00
odd’s ratio =p11 ∗ p00
p10 ∗ p01
10 / 84
Case-control studies
A word on odds ratios
Defining oddsSuppose we have an event with two possible outcomes: success (S)andfailure (S̄)The probability of each occurring happens with ps and pS̄ = 1− ps .The odds of the event are given by ps
1−ps
Defining odds ratiosSuppose now there are two events A and B, both of which can occur(with probabilities pA and pB).
odd’s ratio =odds(A)
odds(B)
=
pA1−pApB
1−pB
=pA × (1− pB)
(1− pA)× pB
11 / 84
Case-control studies
Odds ratio example
Adapted fromhttp://www.ats.ucla.edu/stat/stata/faq/oratio.htm
Suppose that 7 of 10 male applicants to engineering school areadmitted, compared to 4 of 10 female applicants
pmale acc. = 0.7, pmale rej. = 1− 0.7 = 0.3pfemale acc. = 0.4, pfemale rej. = 1− 0.4 = 0.6podds(male acc.) = 0.7
0.3 = 2.33
podds(female acc.) = 0.40.6 = 0.667
OR = 2.330.667 = 3.5
Hence, we can say that the odds of a male applicant being admittedare 3.5 times stronger than for a female applicant.
12 / 84
Case-control studies
Case-control study: spear phishing and academic specialty
Population:Malware spamrecipients
Case: Targetedemail
Control: Un-targeted email
Exposed: Aca-demic Subject
Not Exposed:Other Subjects
Exposed: Aca-demic Subject
Not Exposed:Other Subjects
Present
Past
13 / 84
Notes
Notes
Notes
Notes
Case-control studies
Odds ratios for academic subjects in spear phishing study
14 / 84
Case-control studies
Illicit online pharmacies
What do illicit online pharmacies have to do with phishing?
Both make use of a similar criminal supply chain1 Traffic: hijack web search results (or send email spam)2 Host: compromise a high-ranking server to redirect to pharmacy3 Hook: affiliate programs let criminals set up website front-ends to sell
drugs4 Monetize: sell drugs ordered by consumers5 Cash out: no need to hire mules, just take credit cards!
For more: http://lyle.smu.edu/~tylerm/usenix11.pdf
15 / 84
Case-control studies
Case-control study: search-redirection attacks
Population:pharma searchresults
Case: Search-redirection at-tack
Control: Noredirection
Exposed:.EDU TLDs
Not Exposed:Other TLDs
Exposed:.EDU TLDs
Not Exposed:Other TLDs
Present
Past
16 / 84
Case-control studies
Case-control study: search-redirection attacks
R code: http:
//lyle.smu.edu/~tylerm/courses/econsec/code/pharmaOdds.RData format:
Date Search Engine Search Term Pos. URL Domain Redirects? TLD
2011-11-03 Google 20 mg ambien overdose 1 http://products.sanofi.us/ambien/ambien.pdf sanofi.us False other2011-11-03 Google 20 mg ambien overdose 2 http://swift.sonoma.edu/education/newton/newtonsLaws/?20-mg-ambien-overdose sonoma.edu False .EDU2011-11-03 Google 20 mg ambien overdose 3 http://ambienoverdose.org/about-2/ ambienoverdose.org False .ORG2011-11-03 Google 20 mg ambien overdose 4 http://answers.yahoo.com/question/index?qid=20090712025803AA10g8Z yahoo.com False .COM2011-11-03 Google 20 mg ambien overdose 5 http://en.wikipedia.org/wiki/Zolpidem wikipedia.org False .ORG2011-11-03 Google 20 mg ambien overdose 6 http://blocsonic.com/blog blocsonic.com False .COM2011-11-03 Google 20 mg ambien overdose 7 http://dinarvets.com/forums/index.php?/user/39154-ambien-side-effects/page dinarvets.com False .COM2011-11-03 Google 20 mg ambien overdose 8 http://nemo.mwd.hartford.edu/mwd08/images/?20-mg-ambien-overdose hartford.edu True .EDU2011-11-03 Google 20 mg ambien overdose 9 http://www.formspring.me/AmbienCheapOn formspring.me False other2011-11-03 Google 20 mg ambien overdose 11 http://www.drugs.com/pro/zolpidem.html drugs.com False .COM2011-11-03 Google 20 mg ambien overdose 12 http://www.engineer.tamuk.edu/departments/ieen/images/ambien.html tamuk.edu False .EDU2011-11-03 Bing 20 mg ambien overdose 1 http://answers.yahoo.com/question/index?qid=20090712025803AA10g8Z yahoo.com False .COM2011-11-03 Bing 20 mg ambien overdose 2 http://www.healthcentral.com/sleep-disorders/h/20-mg-ambien-overdose.html healthcentral.com False .COM2011-11-03 Bing 20 mg ambien overdose 3 http://ambien20mg.com/ ambien20mg.com False .COM2011-11-03 bing 20 mg ambien overdose 4 http://www.chacha.com/question/will-20-mg-of-ambien-cr-get-you-high chacha.com True .COM2011-11-03 bing 20 mg ambien overdose 5 http://www.rxlist.com/ambien-drug.htm rxlist.com True .COM2011-11-03 Bing 20 mg ambien overdose 6 http://www.drugs.com/pro/zolpidem.html drugs.com False .COM2011-11-03 Bing 20 mg ambien overdose 7 http://answers.yahoo.com/question/index?qid=20111024222432AARFvPB yahoo.com False .COM2011-11-03 Bing 20 mg ambien overdose 8 http://en.wikipedia.org/wiki/Zolpidem wikipedia.org False .ORG2011-11-03 Bing 20 mg ambien overdose 9 http://www.thefullwiki.org/Sertraline thefullwiki.org False .ORG2011-11-03 bing 20 mg ambien overdose 10 http://www.rxlist.com/edluar-drug.htm rxlist.com True .COM2011-11-03 Bing 20 mg ambien overdose 11 http://www.formspring.me/ambienpill formspring.me False other2011-11-03 Bing 20 mg ambien overdose 12 http://ambiendosage.net/ ambiendosage.net False .NET
17 / 84
Notes
Notes
Notes
Notes
Case-control studies
Odds ratios for case-control study
> library(epitools)
> pr.tldodds<-oddsratio(pr$tld,pr$redirects,verbose=T)
> pr.tldodds$measure
odds ratio with 95% C.I.
Predictor estimate lower upper
.COM 1.0000000 NA NA
.EDU 5.8390966 5.5363269 6.1591917
.GOV 0.4311855 0.3064817 0.5882604
.NET 0.5946029 0.5568593 0.6342355
.ORG 2.8811488 2.7971838 2.9674615
other 1.3437113 1.2809207 1.4090669
18 / 84
Case-control studies
Odds ratios for case-control study
> pr.tldodds$p.value
two-sided
Predictor midp.exact
.COM NA
.EDU 0.000000000000000
.GOV 0.000000009212499
.NET 0.000000000000000
.ORG 0.000000000000000
other 0.000000000000000
two-sided
Predictor fisher.exact
.COM NA
.EDU 0.00000000000000000000000000000000000000000000000000000000000000000000
.GOV 0.00000001116730951558381248266507181077233923360836342908442020416260
.NET 0.00000000000000000000000000000000000000000000000000000000000003109266
.ORG 0.00000000000000000000000000000000000000000000000000000000000000000000
other 0.00000000000000000000000000000002254063153941187904769660716762484880
two-sided
Predictor chi.square
.COM NA
.EDU 0.000000000000000000000000000000000000000000000000000000000000
.GOV 0.000000150899123313924415716095442548116967174109959159977734
.NET 0.000000000000000000000000000000000000000000000000000000017562
.ORG 0.000000000000000000000000000000000000000000000000000000000000
other 0.000000000000000000000000000000000390896706121527347442976835
19 / 84
Case-control studies
How to interpret the odds ratios?
> library(epitools)
> pr.tldodds<-oddsratio(pr$tld,pr$redirects,verbose=T)
> pr.tldodds$measure
odds ratio with 95% C.I.
Predictor estimate lower upper
.COM 1.0000000 NA NA
.EDU 5.8390966 5.5363269 6.1591917
.GOV 0.4311855 0.3064817 0.5882604
.NET 0.5946029 0.5568593 0.6342355
.ORG 2.8811488 2.7971838 2.9674615
other 1.3437113 1.2809207 1.4090669
20 / 84
Regression and survival analysis
Guide to analyzing data
After visual exploration and any descriptive statistics, you may wantto investigate relationships between variables more closely
In particular, you can investigate how one or more explanatory (akaindependent) variables influences response (aka dependent) variables
Statistical Method Response Variable Explanatory Variable
Odds ratios Binary (case/control) Categorical variables (1 at a time)Linear regression Numerical One or more variables (numerical or categorical)Logistic regression Binary One or more variables (numerical or categorical)Survival analysis Time to event One or more variables (numerical or categorical)
22 / 84
Notes
Notes
Notes
Notes
Regression and survival analysis
Linear regression
Suppose the values of a numerical variable Y depend on the values ofanother variable X .
Y = c0 + c1X + ε
If that dependence is linear then we can use linear regression toestimate the best-fit values of the constants c0 and c1 that minimizethe error values for all the values yi ∈ Y .
For more info see “R by Example” Ch. 7.1–7.3
23 / 84
Regression and survival analysis
Why?
24 / 84
Notes
Notes
Notes
Notes
Regression and survival analysis
Dataset for linear regression example
Suppose you hypothesize that the popularity of a CMS platforminfluences the number of exploits made available
We can use linear regression to test for such a relationship
generatorType CMSmarketShare numExploits
blogger 3.5 10concrete5 0.1 1contao 0.2 1datalife engine 1.5 3discuz 1.3 8drupal 7.2 12
Code: http:
//lyle.smu.edu/~tylerm/courses/econsec/code/exregress.R
Data: http:
//lyle.smu.edu/~tylerm/courses/econsec/data/eims.csv
28 / 84
Regression and survival analysis
Scatter plot
●●● ●●●
●●
●
●●
●●
●
●●●●● ●
●
●
●
●
●●●●●●●●●●●●
0 10000000 20000000 30000000 40000000 50000000
020
040
060
080
0
marExp$numServers
mar
Exp
$num
Exp
loits
plot(y=marExp$numExploits,x=marExp$numServers)
29 / 84
Regression and survival analysis
Scatter plot (log-transformed)
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
100000 500000 2000000 10000000 50000000
15
1050
100
500
1000
marExp$numServers
mar
Exp
$num
Exp
loits
plot(y=marExp$numExploits,x=marExp$numServers,log = ’xy’)
30 / 84
Notes
Notes
Notes
Notes
Regression and survival analysis
Linear regression
> reg <- lm(lgExploits ~ lgServers, data = marExp2)
> summary(reg)
Call:
lm(formula = lgExploits ~ lgServers, data = marExp2)
Residuals:
Min 1Q Median 3Q Max
-2.9692 -1.0655 -0.6013 0.5555 5.4554
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.4067 3.1924 -2.947 0.006280 **
lgServers 0.6304 0.1681 3.750 0.000784 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 2.091 on 29 degrees of freedom
Multiple R-squared: 0.3266, Adjusted R-squared: 0.3034
F-statistic: 14.07 on 1 and 29 DF, p-value: 0.0007842
31 / 84
Regression and survival analysis
Best-fit linear regression
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
18 20 22 24
02
46
810
lg(# Servers per CMS)
lg(#
exp
loits
ava
ilabl
e pe
r C
MS
)
blogger
concrete5contao
datalife engine
discuz
drupal
episerver
ez publish
joomla
mediawiki
movable type
mybb
php link directory
phpnuke
plone
prestashop
silverstripe
spip
typepad
typo3
vbulletin
vivvo
wordpress
xoops
cms made simple
dotnetnuke
ip.board
sharepoint
supesite ucoz
umbraco
plot(y = marExp2$lgExploits, x = marExp2$lgServers,
xlab = "lg(# Servers per CMS)",
ylab = "lg(# exploits available per CMS)",
)
text(x = marExp2$lgServers, y = marExp2$lgExploits - 0.3,
lab = marExp2$generatorType)
abline(reg$coef)32 / 84
Regression and survival analysis
Guide to analyzing data
After visual exploration and any descriptive statistics, you may wantto investigate relationships between variables more closely
In particular, you can investigate how one or more explanatory (akaindependent) variables influences response (aka dependent) variables
Statistical Method Response Variable Explanatory Variable
Odds ratios Binary (case/control) Categorical variables (1 at a time)Linear regression Numerical One or more variables (numerical or categorical)Logistic regression Binary One or more variables (numerical or categorical)Survival analysis Time to event One or more variables (numerical or categorical)
33 / 84
Regression and survival analysis
Logistic regression
Suppose we wanted to examine how a numerical variable (e.g.,position in search results) affects a binary response variable (e.g.,whether the URL redirects or not)
We can’t use the odds ratios from case-control studies because thatrequires a categorical variable
Suppose that we’d also like to examine how both position in searchresults and TLD affect whether a URL redirects
For these cases, we need a logistic regression
logp
1− p= c0 + c1 x1 + c2 x2 + ε
So for the example above considering position and TLD:
logpredir
1− predir= c0 + c1 Position1 + c2 TLD2 + ε
34 / 84
Notes
Notes
Notes
Notes
Regression and survival analysis
Illicit online pharmacies
What do illicit online pharmacies have to do with phishing?
Both make use of a similar criminal supply chain1 Traffic: hijack web search results (or send email spam)2 Host: compromise a high-ranking server to redirect to pharmacy3 Hook: affiliate programs let criminals set up website front-ends to sell
drugs4 Monetize: sell drugs ordered by consumers5 Cash out: no need to hire mules, just take credit cards!
For more: http://lyle.smu.edu/~tylerm/usenix11.pdf
35 / 84
Regression and survival analysis
Logistic regression in action
Code: http://lyle.smu.edu/~tylerm/courses/econsec/code/
pharmaLogit.R
> pr.logit <- glm(redirects ~ tld, data=pr, family=binomial(link = "logit"))
> summary(pr.logit)
Call:
glm(formula = redirects ~ tld, family = binomial(link = "logit"),
data = pr)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.1476 -0.5442 -0.5442 -0.5442 2.3438
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.835165 0.008626 -212.75 < 0.0000000000000002 ***
tld.EDU 1.764595 0.027159 64.97 < 0.0000000000000002 ***
tld.GOV -0.845142 0.165381 -5.11 0.000000322 ***
tld.NET -0.519996 0.033165 -15.68 < 0.0000000000000002 ***
tld.ORG 1.058195 0.015079 70.18 < 0.0000000000000002 ***
tldother 0.295390 0.024323 12.14 < 0.0000000000000002 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 165287 on 175794 degrees of freedom
Residual deviance: 156797 on 175789 degrees of freedom
AIC: 156809
Number of Fisher Scoring iterations: 4
> NagelkerkeR2(pr.logit)
$N
[1] 175795
$R2
[1] 0.07736148
36 / 84
Regression and survival analysis
Logistic regression in action (ctd.)
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 165287 on 175794 degrees of freedom
Residual deviance: 156797 on 175789 degrees of freedom
AIC: 156809
Number of Fisher Scoring iterations: 4
> NagelkerkeR2(pr.logit)
$N
[1] 175795
$R2
[1] 0.07736148
37 / 84
Regression and survival analysis
Obtaining the odds ratios
Recall the logistic regression equation
logp
1− p= c0 + c1 x1 + c2 x2 + ε
Exponentiate coefficients to get interpretable odds ratios
> coef(pr.logit)
(Intercept) tld.EDU tld.GOV tld.NET tld.ORG tldother
-1.8351654 1.7645946 -0.8451420 -0.5199959 1.0581945 0.2953898
> #get odds ratios for the coefficients plus 95% CI
> exp(cbind(OR = coef(pr.logit), confint(pr.logit)))
Waiting for profiling to be done...
OR 2.5 % 97.5 %
(Intercept) 0.1595871 0.1569062 0.1623025
tld.EDU 5.8392049 5.5364431 6.1584001
tld.GOV 0.4294964 0.3053796 0.5858515
tld.NET 0.5945230 0.5568118 0.6341472
tld.ORG 2.8811645 2.7972246 2.9675454
tldother 1.3436501 1.2808599 1.4090019
38 / 84
Notes
Notes
Notes
Notes
Regression and survival analysis
Logistic regression #2: TLD and search result position
> pr.logit2 <- glm(redirects ~ tld + resultPosition, data=pr, family=binomial(link = "logit"))
> summary(pr.logit2)
Call:
glm(formula = redirects ~ tld + resultPosition, family = binomial(link = "logit"),
data = pr)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.2680 -0.5968 -0.5355 -0.4757 2.4268
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.14012 0.01497 -142.920 < 0.0000000000000002 ***
tld.EDU 1.77355 0.02726 65.072 < 0.0000000000000002 ***
tld.GOV -0.84060 0.16587 -5.068 0.000000402 ***
tld.NET -0.53121 0.03321 -15.993 < 0.0000000000000002 ***
tld.ORG 1.05185 0.01512 69.587 < 0.0000000000000002 ***
tldother 0.30033 0.02437 12.322 < 0.0000000000000002 ***
resultPosition 0.01803 0.00070 25.762 < 0.0000000000000002 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 165287 on 175794 degrees of freedom
Residual deviance: 156129 on 175788 degrees of freedom
AIC: 156143
Number of Fisher Scoring iterations: 5
39 / 84
Regression and survival analysis
Logistic regression #2: TLD and search result position
> exp(cbind(OR = coef(pr.logit2), confint(pr.logit2)))
Waiting for profiling to be done...
NagelkerkeR2(pr.logit2) #compute pseudo R^2 on logistic regression
OR 2.5 % 97.5 %
(Intercept) 0.1176407 0.1142316 0.1211375
tld.EDU 5.8917404 5.5852012 6.2149893
tld.GOV 0.4314497 0.3067092 0.5886711
tld.NET 0.5878939 0.5505610 0.6271261
tld.ORG 2.8629455 2.7793345 2.9489947
tldother 1.3503082 1.2870831 1.4161226
resultPosition 1.0181977 1.0168021 1.0195962
> NagelkerkeR2(pr.logit2) #compute pseudo R^2 on logistic regression
$N
[1] 175795
$R2
[1] 0.08329341
40 / 84
Regression and survival analysis
Logistic regression #3: TLD, position, search engine
> pr.logit3 <- glm(redirects ~ tld + resultPosition + searchEngine, data=pr, family=binomial(link = "logit"))
> summary(pr.logit3)
Call:
glm(formula = redirects ~ tld + resultPosition + searchEngine,
family = binomial(link = "logit"), data = pr)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.3270 -0.6539 -0.4812 -0.3956 2.5988
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.5813149 0.0172986 -149.221 < 0.0000000000000002 ***
tld.EDU 1.5001887 0.0277776 54.007 < 0.0000000000000002 ***
tld.GOV -0.8537354 0.1666852 -5.122 0.000000303 ***
tld.NET -0.4290936 0.0335099 -12.805 < 0.0000000000000002 ***
tld.ORG 0.9098682 0.0154358 58.945 < 0.0000000000000002 ***
tldother 0.3191095 0.0246746 12.933 < 0.0000000000000002 ***
resultPosition 0.0185985 0.0007081 26.265 < 0.0000000000000002 ***
searchEnginegoogle 0.8310798 0.0137375 60.497 < 0.0000000000000002 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 165287 on 175794 degrees of freedom
Residual deviance: 152322 on 175787 degrees of freedom
AIC: 152338
Number of Fisher Scoring iterations: 5
41 / 84
Regression and survival analysis
Logistic regression #3: TLD, position, search engine
> exp(cbind(OR = coef(pr.logit3), confint(pr.logit3)))
Waiting for profiling to be done...
OR 2.5 % 97.5 %
(Intercept) 0.07567444 0.0731465 0.07827858
tld.EDU 4.48253465 4.2449618 4.73330372
tld.GOV 0.42582135 0.3022669 0.58201442
tld.NET 0.65109897 0.6094052 0.69495871
tld.ORG 2.48399513 2.4099342 2.56025578
tldother 1.37590197 1.3107099 1.44382462
resultPosition 1.01877252 1.0173601 1.02018796
searchEnginegoogle 2.29579645 2.2348606 2.35850810
> NagelkerkeR2(pr.logit3) #compute pseudo R^2 on logistic regression
$N
[1] 175795
$R2
[1] 0.1166546
42 / 84
Notes
Notes
Notes
Notes
Regression and survival analysis
Guide to analyzing data
After visual exploration and any descriptive statistics, you may wantto investigate relationships between variables more closely
In particular, you can investigate how one or more explanatory (akaindependent) variables influences response (aka dependent) variables
Statistical Method Response Variable Explanatory Variable
Odds ratios Binary (case/control) Categorical variables (1 at a time)Linear regression Numerical One or more variables (numerical or categorical)Logistic regression Binary One or more variables (numerical or categorical)Survival analysis Time to event One or more variables (numerical or categorical)
43 / 84
Regression and survival analysis
Survival analysis
time
Infection
reported
Infection
removed
Infection
reported
Infection
removed
Infection
reported
Infection
remains
?
Censored
44 / 84
Regression and survival analysis
Censored data happens a lot
Real-world situations
Life-expectancyCriminal recidivism rates
Cybercrime applications
Measuring time to remove X (where X=malware, phishing, scamwebsite, . . . )Measuring time to compromiseMeasuring time to re-infection
Best resource I found on survival analysis in R: http://socserv.mcmaster.ca/jfox/Courses/soc761/survival-analysis.pdf
45 / 84
Regression and survival analysis
Survival analysis (package survival in R)
Key challenge: estimating probability of survival when some datapoints survive at the end of the measurement
Solution: use the Kaplan-Meier estimator to compute probabilities thataccount for samples still alive (survfit in R)
Common question: Are survival functions split over categoricalvariables statistically different
Use the log-rank test (survdiff in R)Analagous to χ2 test
Cox-proportional hazard model (coxph in R) is a more sophisticatedway to see how multiple variables affect the hazard rate
Hazard function h(t): expected number of failures during the timeperiod t
46 / 84
Notes
Notes
Notes
Notes
Regression and survival analysis
Pharmacy redirection duration by TLD
0 50 100 150 200
0.2
0.4
0.6
0.8
1.0
Survival function for search results (TLD)
t days source infection remains in search results
S(t
)
all95% CI.COM.ORG.EDU.NETother
47 / 84
Regression and survival analysis
Pharmacy redirection duration by PageRank
0 50 100 150 200
0.2
0.4
0.6
0.8
1.0
Survival function for search results (PageRank)
t days source infection remains in search results
S(t
)
all95% CIPR>=70<PR<7PR=0
48 / 84
Regression and survival analysis
Statistics disentangle effect of TLD, PageRank on duration
Cox-proportional hazard modelh(t) = exp(α + PageRankx1 + TLDx2)
coef. exp(coef.) Std. Err.) SignificancePageRank -0.079 0.92 0.0094 p < 0.001.edu -0.26 0.77 0.084 p < 0.001.net 0.10 1.1 0.081.org 0.055 1.1 0.052other TLDs 0.34 1.4 0.053 p < 0.001
log-rank test: Q=159.6, p < 0.001
49 / 84
Regression and survival analysis
Phishing website recompromise
Full paper: http://lyle.smu.edu/~tylerm/cs81.pdf
What constitutes recompromise?
If one attacker loads two phishing websites on the same server a fewhours apart, we classify it as one compromiseIf the phishing pages are placed into different directories, it is morelikely two distinct compromises
For simplicity, we define website recompromise as distinct attacks onthe same host occurring ≥ 7 days apart
83% of phishing websites with recompromises ≥ 7 days apart areplaced in different directories on the server
50 / 84
Notes
Notes
Notes
Notes
Regression and survival analysis
The Webalizer
Web page usage statistics aresometimes set up by default in aworld-readable state
We automatically checked all sitesreported to our feeds for the Webalizerpackage, revealing over 2 486 sitesfrom June 2007–March 2008
1 320 (53%) recorded search termsobtained from ‘Referrer’ header inthe HTTP request
Using these logs, we can determinewhether a host used for phishing hadbeen discovered using targeted search
51 / 84
Regression and survival analysis
Types of evil search
Vulnerability searches: phpizabi v0.848b c1 hfp1 (unrestrictedfile upload vuln.), inurl: com juser (arbitrary PHP execution vuln.)
Compromise searches: allintitle: welcome paypal
Shell searches: intitle: ’’index of’’ r57.php, c99shelldrwxrwx
Search type Websites Phrases VisitsAny evil search 204 456 1 207Vulnerability search 126 206 582Compromise search 56 99 265Shell search 47 151 360
52 / 84
Regression and survival analysis
One phishing website compromised using evil search
53 / 84
Regression and survival analysis
One phishing website compromised using evil search
1: 2007-11-30 10:31:33 phishing URL reported: http://chat2me247.com
/stat/q-mono/pro/www.lloydstsb.co.uk/lloyds_tsb/logon.ibc.html
2: 2007-11-30 no evil search term 0 hits3: 2007-12-01 no evil search term 0 hits4: 2007-12-02 phpizabi v0.415b r3 1 hit5: 2007-12-03 phpizabi v0.415b r3 1 hit6: 2007-12-04 21:14:06 phishing URL reported: http://chat2me247.com
/seasalter/www.usbank.com/online_banking/index.html
7: 2007-12-04 phpizabi v0.415b r3 1 hit
54 / 84
Notes
Notes
Notes
Notes
Regression and survival analysis
Let’s work with the data
R code: http:
//lyle.smu.edu/~tylerm/courses/econsec/code/surviveEvil2.R
Data format:TLD 1st Compromise 2nd Compromise # days Censored Evil searches?
com 2008-01-28 2008-03-31 63 0 TRUEcom 2007-11-23 2008-03-31 129 0 TRUEIP 2008-01-16 2008-03-31 75 0 TRUEcom 2008-01-16 2008-03-31 75 0 TRUEcom 2007-10-28 2007-11-06 8 1 TRUEcom 2008-01-20 2008-03-31 71 0 TRUEjp 2007-11-12 2008-03-31 140 0 TRUEnu 2008-01-31 2008-03-31 60 0 TRUEnet 2007-12-27 2008-03-31 95 0 TRUEcom 2008-02-08 2008-03-31 52 0 TRUEIP 2007-12-07 2008-01-07 31 1 TRUEIP 2008-01-29 2008-03-31 62 0 TRUEcom 2007-10-22 2007-11-14 22 1 TRUEcom 2008-01-22 2008-03-31 69 0 TRUE
55 / 84
Regression and survival analysis
Step 1: Create a survival object
#Remember the definition of censored
# 0 = has not been recompromised
# 1 = has been recompromised
> head(webzlt)
dom startdate enddate lt censored hasevil tld
1 com 2008-01-28 2008-03-31 63 0 TRUE com
2 com 2007-11-23 2008-03-31 129 0 TRUE com
3 IP 2008-01-16 2008-03-31 75 0 TRUE IP
4 com 2008-01-16 2008-03-31 75 0 TRUE com
5 com 2007-10-28 2007-11-06 8 1 TRUE com
6 com 2008-01-20 2008-03-31 71 0 TRUE com
> S.all<-Surv(time=webzlt$lt,event=webzlt$censor,type=’right’)
56 / 84
Regression and survival analysis
Working with survival objects
1 Empirically estimate survival probability overall
Supply survfit with a constant right-hand side formulaE.g.:
surv.all<-survfit(S.all~1)
2 Empirically estimate survival probability compared to singlecategorical variable
Supply survfit with a constant categorical variable in right-hand sideof formulaE.g.:
survfit(S.all~webzlt$hasevil)
3 Regression with survival probability as response variable
Supply survfit with a constant categorical variable in right-hand sideof formulaE.g.:
coxph( S.all ~ webzlt$hasevil, method="breslow")
57 / 84
Regression and survival analysis
#1: Empirically estimate survival probability overall
0 50 100 150
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Survival function for phishing websites
t days before recompromise
S(t
): p
roba
bilit
y w
ebsi
te h
as n
ot b
een
reco
mpr
omis
ed w
ithin
t da
ys
S.all<-Surv(time=webzlt$lt,event=webzlt$censor,type=’right’)
surv.all<-survfit(S.all~1)
plot(surv.all,xlab=’t days before recompromise’,
ylab=’S(t): probability website has not been recompromised within t days’,
ylim=c(0.4,1), main=’Survival function for phishing websites’,lwd=1.5)
58 / 84
Notes
Notes
Notes
Notes
Regression and survival analysis
#2: Emp. estimate survival prob. for 1 cat. var.
0 50 100 150
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Survival function for phishing websites
t days before recompromise
S(t
)
has evil termsno evil terms
S.all<-Surv(time=webzlt$lt,event=webzlt$censor,type=’right’)
surv.evil<-survfit(S.all~webzlt$hasevil)
plot(surv.evil,xlab=’t days before recompromise’,
ylab=’S(t)’,ylim=c(0.4,1), lwd=1.5,col=c(’blue’,’red’),
main=’Survival function for phishing websites’)
legend("topright",legend=c("has evil terms","no evil terms"),
col=c("red","blue"),lty=1) 59 / 84
Regression and survival analysis
#2: Emp. estimate survival prob. for 1 cat. var.
Is the difference between survival probabilities across categoriesstatistically significant?
> survdiff(S.all~webzlt$hasevil)
Call:
survdiff(formula = S.all ~ webzlt$hasevil)
N Observed Expected (O-E)^2/E (O-E)^2/V
webzlt$hasevil=FALSE 746 140 156.7 1.79 13.4
webzlt$hasevil=TRUE 121 41 24.3 11.55 13.4
Chisq= 13.4 on 1 degrees of freedom, p= 0.000249
60 / 84
Regression and survival analysis
#3: Regression with survival prob. as response variable
S.all<-Surv(time=webzlt$lt,event=webzlt$censor,type=’right’)
evil.ph <- coxph( S.all ~ webzlt$hasevil, method="breslow")
summary(evil.ph)
> summary(evil.ph)
Call:
coxph(formula = Surv(webzlt$lt, webzlt$censor) ~ webzlt$hasevil,
method = "breslow")
n= 867, number of events= 181
coef exp(coef) se(coef) z Pr(>|z|)
webzlt$hasevilTRUE 0.6393 1.8951 0.1778 3.595 0.000325 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
exp(coef) exp(-coef) lower .95 upper .95
webzlt$hasevilTRUE 1.895 0.5277 1.337 2.685
Concordance= 0.539 (se = 0.013 )
Rsquare= 0.013 (max possible= 0.932 )
Likelihood ratio test= 11.43 on 1 df, p=0.0007219
Wald test = 12.92 on 1 df, p=0.0003246
Score (logrank) test = 13.37 on 1 df, p=0.00025661 / 84
Regression and survival analysis
One more survival example: Bitcoin currency exchanges
Bitcoin is a digital crypto-currency
Decentralization is a key feature of Bitcoin’s design
Yet an extensive ecosystem of 3rd-party intermediaries nowsupports Bitcoin transactions: currency exchanges, escrow services,online wallets, mining pools, investment services, . . .
Most risk Bitcoin holders face stems from interacting with theseintermediaries, who act as de facto central authorities
We focus on risk posed by failures of currency exchanges
R code: http:
//lyle.smu.edu/~tylerm/data/bitcoin/bitcoinExScript.R
62 / 84
Notes
Notes
Notes
Notes
Notes
Notes
Notes
Notes
Regression and survival analysis
Data collection methodology
Data sources1 Daily transaction volume data on 40 exchanges converting into 33
currencies from bitcoincharts.com2 Checked for closure, mention of security breaches and whether
investors were repaid on Bitcoin Wiki and forums3 To assess impact of pressure from financial regulators, we identified
each exchange’s country of incorporation and used a World Bank indexon compliance with anti-money laundering regulations
Key measure: exchange lifetime
Time difference between first and last observed tradeWe deem an exchange closed if no transactions are observed at least 2weeks before data collection finished
69 / 84
Regression and survival analysis
Some initial summary statistics
40 Bitcoin currency exchanges opened since 2010
18 have subsequently closed (45% failure rate)
Median lifetime is 381 days45% of closed exchanges did not reimburse customers
9 exchanges were breached (5 closed)
70 / 84
Notes
Notes
Notes
Notes
Regression and survival analysis
18 closed Bitcoin currency exchanges
Exchange Origin Dates Active Daily vol. Closed? Breached? Repaid? AML
BitcoinMarket US 4/10 – 6/11 2454 yes yes – 34.3Bitomat PL 4/11 – 8/11 758 yes yes yes 21.7FreshBTC PL 8/11 – 9/11 3 yes no – 21.7Bitcoin7 US/BG 6/11 – 10/11 528 yes yes no 33.3ExchangeBitCoins.com US 6/11 – 10/11 551 yes no – 34.3Bitchange.pl PL 8/11 – 10/11 380 yes no – 21.7Brasil Bitcoin Market BR 9/11 – 11/11 0 yes no – 24.3Aqoin ES 9/11 – 11/11 11 yes no – 30.7Global Bitcoin Exchange ? 9/11 – 1/12 14 yes no – 27.9Bitcoin2Cash US 4/11 - 1/12 18 yes no – 34.3TradeHill US 6/11 - 2/12 5082 yes yes yes 34.3World Bitcoin Exchange AU 8/11 – 2/12 220 yes yes no 25.7Ruxum US 6/11 – 4/12 37 yes no yes 34.3btctree US/CN 5/12 – 7/12 75 yes no yes 29.2btcex.com RU 9/10 – 7/12 528 yes no no 27.7IMCEX.com SC 7/11 – 10/12 2 yes no – 11.9Crypto X Change AU 11/11 – 11/12 874 yes no – 25.7Bitmarket.eu PL 4/11 – 12/12 33 yes no no 21.7
71 / 84
Regression and survival analysis
22 open Bitcoin currency exchanges
Exchange Origin Dates Active Daily vol. Closed? Breached? Repaid? AML
bitNZ NZ 9/11 – pres. 27 no no – 21.3ICBIT Stock Exchange SE 3/12 – pres. 3 no no – 27.0WeExchange US/AU 10/11 – pres. 2 no no – 30.0Vircurex US? 12/11 – pres. 6 no yes – 27.9btc-e.com BG 8/11 – pres. 2604 no yes yes 32.3Mercado Bitcoin BR 7/11 – pres. 67 no no – 24.3Canadian Virtual Exchange CA 6/11 – pres. 832 no no – 25.0btcchina.com CN 6/11 – pres. 473 no no – 24.0bitcoin-24.com DE 5/12 – pres. 924 no no – 26.0VirWox DE 4/11 – pres. 1668 no no – 26.0Bitcoin.de DE 8/11 – pres. 1204 no no – 26.0Bitcoin Central FR 1/11 – pres. 118 no no – 31.7Mt. Gox JP 7/10 – pres. 43230 no yes yes 22.7Bitcurex PL 7/12 – pres. 157 no no – 21.7Kapiton SE 4/12 – pres. 160 no no – 27.0bitstamp SL 9/11 – pres. 1274 no no – 35.3InterSango UK 7/11 – pres. 2741 no no – 35.3Bitfloor US 5/12 – pres. 816 no yes no 34.3Camp BX US 7/11 – pres. 622 no no – 34.3The Rock Trading Company US 6/11 – pres. 52 no no – 34.3bitme US 7/12 – pres. 77 no no – 34.3FYB-SG SG 1/13 – pres. 3 no no – 33.7
72 / 84
Regression and survival analysis
What factors affect whether an exchange closes?
We hypothesize three variables affect survival time for a Bitcoinexchange
1 Average daily transaction volume (positive)2 Experiencing security breach (negative)3 AML/CFT compliance (negative)
Since lifetimes are censored, we construct a Cox proportional hazardsmodel:
hi (t) = h0(t) exp(β1 log(Daily vol.)i + β2Breachedi + β3AMLi ).
73 / 84
Regression and survival analysis
R code: Cox proportional hazards model
cox.vh<-coxph(Surv(time=amlsv$lifetime,event=amlsv$censored,type=’right’)~
log2(amlsv$dailyvol)+amlsv$Hacked+amlsv$All,
method="breslow")
> cox.vh
Call:
coxph(formula = Surv(time = amlsv$lifetime, event = amlsv$censored,
type = "right") ~ log2(amlsv$dailyvol) + amlsv$Hacked + amlsv$All,
method = "breslow")
coef exp(coef) se(coef) z p
log2(amlsv$dailyvol) -0.17396 0.84 0.0719 -2.4185 0.016
amlsv$HackedTRUE 0.85685 2.36 0.5715 1.4992 0.130
amlsv$All 0.00411 1.00 0.0421 0.0978 0.920
Likelihood ratio test=6.28 on 3 df, p=0.0988 n= 40, number of events= 18
74 / 84
Notes
Notes
Notes
Notes
Regression and survival analysis
Cox proportional hazards model: results
coef. exp(coef.) Std. Err.) Significancelog(Daily vol.)i β1 -0.173 0.840 0.072 p = 0.0156Breachedi β2 0.857 2.36 0.572 p = 0.1338AMLi β3 0.004 1.004 0.042 p = 0.9221
log-rank test: Q=7.01 (p = 0.0715), R2 = 0.145
Higher daily transaction volumes associated with longer survival times(statistically significant)
Experiencing a breach associated with shorter survival times (notquite statistically significant)
75 / 84
Regression and survival analysis
Survival probability for Bitcoin exchanges
0 200 400 600 800
0.0
0.2
0.4
0.6
0.8
1.0
Days
Sur
viva
l pro
babi
lity
Average
76 / 84
Regression and survival analysis
R code: Survival probability for Bitcoin exchanges
par(mar=c(4.1,4.1,0.5,0.5))
plot(survfit(cox.vh),col="black",lty="solid",lwd=2,
xlab="Days",
ylab="Survival probability",
cex.lab=1.3,
cex.axis=1.3
)
legend("topright",legend=c("Average"),col=c("black"),lwd=2,lty=c("solid"))
77 / 84
Regression and survival analysis
Reminder: data frame structure
> cox.vh
Call:
coxph(formula = Surv(time = amlsv$lifetime, event = amlsv$censored,
type = "right") ~ log2(amlsv$dailyvol) + amlsv$Hacked + amlsv$All,
method = "breslow")
coef exp(coef) se(coef) z p
log2(amlsv$dailyvol) -0.17396 0.84 0.0719 -2.4185 0.016
amlsv$HackedTRUE 0.85685 2.36 0.5715 1.4992 0.130
amlsv$All 0.00411 1.00 0.0421 0.0978 0.920
Likelihood ratio test=6.28 on 3 df, p=0.0988 n= 40, number of events= 18
> head(amlsv[,c(’dailyvol’,’Hacked’,’All’)],10)
dailyvol Hacked All
Global Bitcoin Exchnage 13.7413402 FALSE 27.866
Vircurex 5.6135567 TRUE 27.866
Crypto X Change 874.2331200 FALSE 25.670
World Bitcoin Exchange 220.0284211 TRUE 25.670
btc-e.com 2603.7702724 TRUE 32.330
Mercado Bitcoin 67.0104275 FALSE 24.330
Brasil Bitcoin Market 0.1896721 FALSE 24.330
Canadian Virtual Exchange 832.3611224 FALSE 25.000
btcchina.com 472.6303602 FALSE 24.000
bitcoin-24.com 923.6339683 FALSE 26.000
78 / 84
Notes
Notes
Notes
Notes
Regression and survival analysis
High-volume exchanges have better chance to survive
0 200 400 600 800
0.0
0.2
0.4
0.6
0.8
1.0
Days
Sur
viva
l pro
babi
lity
Mt. GoxIntersangoAverage
79 / 84
Regression and survival analysis
R code: High-volume exchanges have better chance tosurvive
coxplots<-survfit(cox.vh,newdata=amlsv)
par(mar=c(4.1,4.1,0.5,0.5))
plot(coxplots[15],col="green",lty="dashed",lwd=2,
xlab="Days",
ylab="Survival probability",
cex.lab=1.3,
cex.axis=1.3
) #Mt Gox
lines(coxplots[28],col="blue",lty="dotdash",lwd=2) #Intersango
lines(survfit(cox.vh),lwd=2) #Mean
legend("topright",legend=c("Mt. Gox","Intersango","Average"),
col=c("green","blue","black"),lwd=2,
lty=c("dashed","dotdash","solid"))
80 / 84
Regression and survival analysis
Low-volume exchanges have worse chance to survive
0 200 400 600 800
0.0
0.2
0.4
0.6
0.8
1.0
Days
Sur
viva
l pro
babi
lity
Mt. GoxIntersangoBitcoin2CashAverage
81 / 84
Regression and survival analysis
Yet some lower-risk exchanges collapse, high-risk survive
0 200 400 600 800
0.0
0.2
0.4
0.6
0.8
1.0
Days
Sur
viva
l pro
babi
lity
Mt. Gox
Intersango
Bitcoin2Cash
VircurexExchangeBitCoins.comAverage
82 / 84
Notes
Notes
Notes
Notes