10
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 Can We Use Soſtware Bug Reports to Identify Vulnerability Discovery Strategies? Farzana Ahamed Bhuiyan Tennessee Technological University Cookeville, Tennessee, USA [email protected] Raunak Shakya Tennessee Technological University Cookeville, Tennessee, USA [email protected] Akond Rahman Tennessee Technological University Cookeville, Tennessee, USA [email protected] ABSTRACT Daily horror stories related to software vulnerabilities necessitates the understanding of how vulnerabilities are discovered. Identifi- cation of data sources that can be leveraged to understand how vulnerabilities are discovered could aid cybersecurity researchers to characterize exploitation of vulnerabilities. The goal of the paper is to help cybersecurity researchers in characterizing vulnerabilities by conducting an empirical study of software bug reports. We apply qual- itative analysis on 729, 908, and 5336 open source software (OSS) bug reports respectively, collected from Gentoo, LibreOffice, and Mozilla to investigate if bug reports include vulnerability discovery strategies i.e. sequences of computation and/or cognitive activities that an attacker performs to discover vulnerabilities, where the vulnerability is indexed by a credible source, such as the National Vulnerability Database (NVD). We evaluate two approaches namely, text feature-based approach and regular expression-based approach to automatically identify bug reports that include vulnerability dis- covery strategies. We observe the Gentoo, LibreOffice, and Mozilla bug reports to include vulnerability discovery strategies. Using text feature- based prediction models, we observe the highest prediction per- formance for the Mozilla dataset with a recall of 0.78. Using the regular expression-based approach we observe recall to be 0.83 for the same dataset. Findings from our paper provide the ground- work for cybersecurity researchers to use OSS bug reports as a data source for advancing the science of vulnerabilities. CCS CONCEPTS Security and privacy Software security engineering. KEYWORDS bug report, empirical study, ethical hacking, strategy, vulnerability ACM Reference Format: Farzana Ahamed Bhuiyan, Raunak Shakya, and Akond Rahman. 2020. Can We Use Software Bug Reports to Identify Vulnerability Discovery Strategies?. In Hot Topics in the Science of Security Symposium (HotSoS ’20), April 7– 8, 2020, Lawrence, KS, USA. ACM, New York, NY, USA, 10 pages. https: //doi.org/10.1145/3384217.3385618 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. HotSoS ’20, April 7–8, 2020, Lawrence, KS, USA © 2020 Association for Computing Machinery. ACM ISBN 978-1-4503-7561-0/20/04. . . $15.00 https://doi.org/10.1145/3384217.3385618 1 INTRODUCTION According to the United States Department of Homeland Secu- rity, the information technology (IT) sector, which encompasses software and software-based services is a critical infrastructure and “central to the nation’s security, economy, and public health and safety”[29]. However, software vulnerabilities can lead to large- scale consequences. For example, in July 2019, the ‘DELL PC Doctor’ vulnerability impacted millions of Dell computers [13, 14]. Charac- terizing how vulnerabilities are discovered can be helpful to devise mechanisms that will protect software and software-based services early at the development stage. Creation of knowledge related to vulnerability discovery has gained interest in the research community. For example, King [30] emphasized on the understanding of attacker actions to characterize the nature of vulnerabilities. Munaiah et al. [28] also expressed similar views: “The ability to understand the way in which these adversaries discover and exploit vulnerabilities to infiltrate systems can help bring the attacker mindset out of the wild and into the software development process”. Identifying vulnerability discovery strategies can be useful to characterize vulnerabilities in software systems. A vulnerability discovery strategy is one or more computational and/or cogni- tive activities that an attacker performs to discover a vulnerability, where the vulnerability is indexed by a credible source, such as the National Vulnerability Database (NVD). However, the availabil- ity of data sources to study vulnerability discovery strategies is limited [28]. Researchers in prior work [28] have stated that even though identifying vulnerability discovery strategies is important, conducting such analysis could be challenging due to “scarcity of such empirical information”. One option to identify vulnerability discovery strategies could be the use of bug reports. A bug report is a description of bugs that are typically indexed by bug databases. In the software engineering domain, researchers have used open source software (OSS) bug reports successfully to characterize and quantify software bugs [35]. Based on evidence reported by prior research [5, 7, 8], we conjecture that bug reports from OSS can be used to identify vulnerability discovery strategies. Prior research [5, 7, 8] has reported that OSS bug reports include what sequence of actions were performed to discover and reproduce a certain bug in a software. Our hypothesis is that OSS bug reports will include sequence of actions that express how the vulnerability was discovered by the attacker. For example, if an OSS bug report includes steps of actions on how to reproduce a vulnerability, then the bug report can further be investigated to find out how the attacker discovered the vulnerability. OSS bug reports are retrievable through OSS APIs, which researchers can 1

Can We Use Software Bug Reports to ... - akondrahman.github.io · 1 INTRODUCTION According to the United States Department of Homeland Secu-rity, the information technology (IT) sector,

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Can We Use Software Bug Reports to ... - akondrahman.github.io · 1 INTRODUCTION According to the United States Department of Homeland Secu-rity, the information technology (IT) sector,

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

Can We Use Software Bug Reports to Identify VulnerabilityDiscovery Strategies?

Farzana Ahamed BhuiyanTennessee Technological University

Cookeville, Tennessee, [email protected]

Raunak ShakyaTennessee Technological University

Cookeville, Tennessee, [email protected]

Akond RahmanTennessee Technological University

Cookeville, Tennessee, [email protected]

ABSTRACTDaily horror stories related to software vulnerabilities necessitatesthe understanding of how vulnerabilities are discovered. Identifi-cation of data sources that can be leveraged to understand howvulnerabilities are discovered could aid cybersecurity researchers tocharacterize exploitation of vulnerabilities. The goal of the paper isto help cybersecurity researchers in characterizing vulnerabilities byconducting an empirical study of software bug reports.We apply qual-itative analysis on 729, 908, and 5336 open source software (OSS)bug reports respectively, collected from Gentoo, LibreOffice, andMozilla to investigate if bug reports include vulnerability discoverystrategies i.e. sequences of computation and/or cognitive activitiesthat an attacker performs to discover vulnerabilities, where thevulnerability is indexed by a credible source, such as the NationalVulnerability Database (NVD). We evaluate two approaches namely,text feature-based approach and regular expression-based approachto automatically identify bug reports that include vulnerability dis-covery strategies.

We observe the Gentoo, LibreOffice, and Mozilla bug reportsto include vulnerability discovery strategies. Using text feature-based prediction models, we observe the highest prediction per-formance for the Mozilla dataset with a recall of 0.78. Using theregular expression-based approach we observe recall to be 0.83for the same dataset. Findings from our paper provide the ground-work for cybersecurity researchers to use OSS bug reports as a datasource for advancing the science of vulnerabilities.

CCS CONCEPTS• Security and privacy→ Software security engineering.

KEYWORDSbug report, empirical study, ethical hacking, strategy, vulnerabilityACM Reference Format:Farzana Ahamed Bhuiyan, Raunak Shakya, and Akond Rahman. 2020. CanWeUse Software Bug Reports to Identify Vulnerability Discovery Strategies?.In Hot Topics in the Science of Security Symposium (HotSoS ’20), April 7–8, 2020, Lawrence, KS, USA. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3384217.3385618

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] ’20, April 7–8, 2020, Lawrence, KS, USA© 2020 Association for Computing Machinery.ACM ISBN 978-1-4503-7561-0/20/04. . . $15.00https://doi.org/10.1145/3384217.3385618

1 INTRODUCTIONAccording to the United States Department of Homeland Secu-rity, the information technology (IT) sector, which encompassessoftware and software-based services is a critical infrastructureand “central to the nation’s security, economy, and public health andsafety” [29]. However, software vulnerabilities can lead to large-scale consequences. For example, in July 2019, the ‘DELL PC Doctor’vulnerability impacted millions of Dell computers [13, 14]. Charac-terizing how vulnerabilities are discovered can be helpful to devisemechanisms that will protect software and software-based servicesearly at the development stage.

Creation of knowledge related to vulnerability discovery hasgained interest in the research community. For example, King [30]emphasized on the understanding of attacker actions to characterizethe nature of vulnerabilities. Munaiah et al. [28] also expressedsimilar views: “The ability to understand the way in which theseadversaries discover and exploit vulnerabilities to infiltrate systemscan help bring the attacker mindset out of the wild and into thesoftware development process”.

Identifying vulnerability discovery strategies can be useful tocharacterize vulnerabilities in software systems. A vulnerabilitydiscovery strategy is one or more computational and/or cogni-tive activities that an attacker performs to discover a vulnerability,where the vulnerability is indexed by a credible source, such asthe National Vulnerability Database (NVD). However, the availabil-ity of data sources to study vulnerability discovery strategies islimited [28]. Researchers in prior work [28] have stated that eventhough identifying vulnerability discovery strategies is important,conducting such analysis could be challenging due to “scarcity ofsuch empirical information”.

One option to identify vulnerability discovery strategies couldbe the use of bug reports. A bug report is a description of bugs thatare typically indexed by bug databases. In the software engineeringdomain, researchers have used open source software (OSS) bugreports successfully to characterize and quantify software bugs [35].Based on evidence reported by prior research [5, 7, 8], we conjecturethat bug reports from OSS can be used to identify vulnerabilitydiscovery strategies. Prior research [5, 7, 8] has reported that OSSbug reports include what sequence of actions were performed todiscover and reproduce a certain bug in a software. Our hypothesisis that OSS bug reports will include sequence of actions that expresshow the vulnerability was discovered by the attacker. For example,if an OSS bug report includes steps of actions on how to reproducea vulnerability, then the bug report can further be investigated tofind out how the attacker discovered the vulnerability. OSS bugreports are retrievable through OSS APIs, which researchers can

1

Page 2: Can We Use Software Bug Reports to ... - akondrahman.github.io · 1 INTRODUCTION According to the United States Department of Homeland Secu-rity, the information technology (IT) sector,

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

HotSoS ’20, April 7–8, 2020, Lawrence, KS, USA Bhuiyan et al.

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

use to construct datasets that can further be analyzed to identifyvulnerability discovery strategies.

The goal of the paper is to help cybersecurity researchers in charac-terizing vulnerabilities by conducting an empirical study of softwarebug reports.

We answer the following research questions:• RQ1: How frequently do vulnerability discovery strategiesappear in bug reports?

• RQ2: How can we automatically identify bug reports thatinclude vulnerability discovery strategies?

We conduct an empirical analysis with OSS bug reports collectedfrom IT organizations who have open sourced the bug reports andsource code of their software. We apply qualitative analysis on729, 908, and 5336 OSS bug reports respectively, collected fromGentoo, LibreOffice, and Mozilla to investigate if those reportsinclude vulnerability discovery strategies. We also use two auto-mated approaches namely, text feature-based approach and regularexpressions to identify bug reports that include vulnerability dis-covery strategies. For text feature-based approach, we construct n-grams [21] from the bug report content, and construct models withstatistical learners, such as classification and regression tree [3],logistic regression [16], and deep neural networks [23]. We evaluateour models using two approaches: 10×10-fold validation and crossdataset evaluation.

We make the following contributions:• An empirical study that shows evidence of vulnerabilitydiscovery strategies that are present in OSS bug reports; and

• A set of prediction models that automatically identifies pres-ence of vulnerability discovery strategies in OSS bug reports.

We organize the rest of the paper as following: in Section 2we provide necessary background and prior research related toour paper. We present the methodology of our paper in Section 3.We provide empirical findings in Section 4, and discuss resultsin Section 5. We describe the threats to validity for our paper inSection 6. Finally, we conclude the paper in Section 7.

2 BACKGROUND AND RELATEDWORKIn this section we provide necessary background on bug reportsand discuss relevant prior work.

2.1 Background on Bug ReportsBug reports are software artifacts that are used by software devel-opment teams to track bugs and tasks. Development teams use bothopen source and proprietary tools to track their bugs. Examples ofopen source tools include Bugzilla 1, Fossil 2, and Trac 3. Examplesof proprietary bug tracking tools include Jira 4 and Team Founda-tion Server 5. Bug reports are also referred to as issue trackers.

Typical entities of bug reports include but are not limited to:• title that summarizes the bug;• a bug ID that is unique across all listed bugs;• timestamp on when the bug report was created and updated;

1https://www.bugzilla.org/2https://www.fossil-scm.org/home/doc/trunk/www/index.wiki3https://trac.edgewall.org/4https://www.atlassian.com/software/jira5https://azure.microsoft.com/en-us/services/devops/server/

• affected product information e.g. operating system, producttype, and platform;

• bug report category;• bug report priority; and• comments that discuss how to reproduce the bug.

We provide an annotated snapshot of a bug report 6 retrievedfrom the Mozilla organization in Figure 1 to illustrate entities of abug report even further.

Bug IDTimestamp

Title

Product Information

Comment

Bug Category

Figure 1: Annotation of a bug report retrieved fromMozilla.

2.2 Related WorkOur paper is closely related to prior work that have investigatedbug localization, bug report identification, and bug report qualityassessment. For example, Peters et al. [32] proposed ‘FARSEC’, atechnique to reduce mislabelling of security bug reports using text-based prediction models. Peters et al. [32] developed FARSEC forautomatic identification of security cross-words and for scoring bugreports according to how likely they are to be labeled as securitybug reports.

Bug reports have also been leveraged for locating bugs in soft-ware. Dilshener et al. [12] proposed and evaluated an approach thatuses a heuristic-based score for a file against a given report, withoutrequiring historical dataset of code and bug reports. Ali et al. [1]proposed and evaluated ‘LIBCROOS’ that combines the results ofinformation retrieval techniques with binary class relationshipsgathered through source code analyses.

Researchers have also investigated automated retrieval of in-formation from bug reports. Chaparro et al. [7] investigated theeffectiveness of query reduction strategies, based on the structureof bug descriptions available from bug reports. The authors [7]proposed an automated technique where developers issue an initialquery from the full text of the bug report and inspect the top-N codecandidates returned by text retrieval-based bug localization tech-niques. To reformulate the query, the authors [7] used 31 strategiesusing 5 text retrieval-based bug localization techniques. In anotherwork, Chaparro et al. [8] proposed, implemented, and evaluated‘DEMIBUD’, an automated technique to detect if bug information ismissing in bug reports to reproduce a bug. The authors [8] identi-fied patterns that captured the discourse and developed 3 versions6https://bugzilla.mozilla.org/show_bug.cgi?id=519925

2

Page 3: Can We Use Software Bug Reports to ... - akondrahman.github.io · 1 INTRODUCTION According to the United States Department of Homeland Secu-rity, the information technology (IT) sector,

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

Can We Use Software Bug Reports to Identify Vulnerability Discovery Strategies? HotSoS ’20, April 7–8, 2020, Lawrence, KS, USA

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

of DEMIBUD, based on heuristics, natural language processing andlinear support vector machine to detect the absence of expectedsoftware behavior (EB) and steps to reproduce (S2R) in bug reports.In another work, Chaparro et al. [5] proposed and evaluated EULER,an automated technique to identify and evaluate the quality of thesteps to reproduce in a bug report and provide feedback to thereporters about ambiguous steps. Chaparro et al. [5] used sequencelabeling in combination with discourse patterns and dependencyparsing to identify steps to reproduce sentences. EULER was alsocapable to provide specific feedback to the reporter about missingsteps to reproduce a bug. Chaparro et al. [6] conducted an empiricalstudy to investigate to what extent duplicate bug report descrip-tions use the same vocabulary and the impact of the vocabularyagreement on duplicate detection of bug reports. Zhao et al. [40]proposed and evaluated ‘S2RMiner’, an automated technique thatextracts the text description of steps to reproduce from bug reportsusing HTML parsing, natural language processing and machinelearning techniques. The authors downloaded bug reports as HTMLfiles and used HTML parsing to extract relevant text from the filesof the bug reports. From the extracted content the authors usednatural language processing to obtain text features of each sentence,and applied support vector machines in bug reports to predict andextract steps to reproduce.

The above-mentioned discussion shows a plethora of researchrelated to bug reports e.g., bug information retrieval, bug localiza-tion, and bug report quality assessment. However, we notice lackof research that investigates if bug reports contain information onhow vulnerabilities are discovered. We address this research gap inour paper.

3 METHODOLOGYIn this section, first we provide necessary definitions listed below:

• Vulnerability: A vulnerability is a weakness in computa-tional logic existing in software and hardware components,which upon exploitation can result in a negative impact toconfidentiality, integrity, or availability [11].

• Vulnerability discovery strategy: One or more computationaland/or cognitive activities that an attacker performs to dis-cover vulnerabilities, where the vulnerability is indexed bya credible source, such as the NVD.

Now, we describe ourmethodology to conduct the research study,which is summarized in Figure 2.

3.1 Bug Database MiningIn our paper we focus on identifying if vulnerability discoverystrategies exist in software bug reports. The first step of performingsuch analysis is to mine bug databases and collect bug reports. Werely on OSS bug reports as they are accessible via APIs. We collectbug reports from three organizations: Gentoo Linux 7, LibreOffice 8,and Mozilla 9. We select these organizations as they are differentfrom one another with respect to the software product that theydeliver to the end-users. Gentoo Linux is an OSS Linux-based oper-ating system. LibreOffice is an OSS for word processing. Mozilla

7https://www.gentoo.org/8https://www.libreoffice.org/9https://www.mozilla.org/en-US/

produces a wide range of OSS products including browsers (MozillaFirefox and Camino), e-mail clients (Mozilla Thunderbird), bugtracking system (Bugzilla) and rendering engines (Gecko) 10. Ourassumption is that by collecting bug reports from a set of organi-zations that produce a wide range of software for end-users, wewill be able to increase the generalizability of our findings. For allthree organizations namely, Gentoo Linux, LibreOffice, and Mozillawe use the Bugzilla API to collect bug reports. We collect all bugreports used for our paper on July 25, 2019.

3.2 Bug Report FilteringWe include vulnerabilities that are indexed by the National Vulner-ability Database (NVD). Vulnerabilities in the NVD are indexed bythe keyword ‘CVE’ 11. Upon collection of the bug reports from thethree organizations, we apply filtering criteria to identify whichbug reports are related to a vulnerability. The filtering criteria arelisted below:

• Step-1: We search for the keyword ‘CVE’ in title, description,and comments for each collected bug report.

• Step-2: From the search results, we manually examine if abug report actually is related to a vulnerability indexed bythe NVD.

Upon completion of this step we will obtain a set of bug reportsthat are related to a vulnerability indexed by the NVD. Each col-lected bug report maps to a CVE. As bug reports can be duplicated,multiple bug reports can map to the same CVE.

3.3 Qualitative Rating to Construct StrategyOracle Dataset

We construct an oracle dataset to identify if a bug report containsinformation on what strategy was adopted to discover a vulnera-bility. We use raters to identify if a bug report contains strategiesto discover a vulnerability. In our paper, a rater is a person whois knowledgeable in software security, and performs qualitativeanalysis. Each rater, who are experienced in software security andbug report, individually looks at a bug report and determines if thebug report provides the strategy on how the reported vulnerabilitywas discovered. A rater determines a bug report to contain a strat-egy to discover the vulnerability if each of the following criteria issatisfied:

• Criteria-1: The bug report includes text patterns that indicatewhether the reporter is describing on how the vulnerabil-ity is discovered. The rater manually examines if the bugreport includes any of the following text patterns: ‘steps’,‘reproduce’, and ‘observed behavior’;

• Criteria-2: The bug report includes a sequence of steps onhow the vulnerability was discovered; and

• Criteria-3: The bug report provides output of the conductedactivities in the form of attachments, such as console outputand screenshots.

We report the agreement rate by calculating Cohen’s Kappa [9]to record the agreement level between raters. We follow Landisand Koch [22]’s interpretations to interpret Cohen’s Kappa values.

10http://kb.mozillazine.org/Summary_of_Mozilla_products11https://nvd.nist.gov/vuln

3

Page 4: Can We Use Software Bug Reports to ... - akondrahman.github.io · 1 INTRODUCTION According to the United States Department of Homeland Secu-rity, the information technology (IT) sector,

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

HotSoS ’20, April 7–8, 2020, Lawrence, KS, USA Bhuiyan et al.

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

Bug database mining

Bug report filtering

Qualitative rating

Answer to RQ1

Answer to RQ2

Figure 2: Steps to conduct our research study.

Table 1: Cohen’s Kappa Interpretation by Landis andKoch [22]

Cohen’s Kappa Interpretation< 00.0 ‘Poor agreement’0.00∼0.20 ‘Slight agreement’0.21∼0.40 ‘Fair agreement’0.41∼0.60 ‘Moderate agreement’0.61∼0.80 ‘Substantial agreement’0.81∼1.00 ‘Almost perfect agreement’

Interpretation of Landis and Koch [22] for Cohen’s Kappa is listedin Table 1.

The raters may disagree. We allocate another rater who is experi-enced in bug reports and software security to resolve disagreements.In the case of disagreements, the resolver’s decision is final.

3.4 Answer to RQ1: How frequently dovulnerability discovery strategies appear inbug reports?

Answer to RQ1 demonstrates evidence on whether or not bug re-ports contain strategies that are used to discover vulnerabilities. Weanswer RQ1 by using the oracle dataset constructed from Section 3.3.The oracle dataset provides a mapping onwhich bug reports includea strategy to discover a vulnerability. From the mapping we canquantify how many bug reports include a vulnerability discoverystrategy. We use two metrics to quantify frequency:

• count of bug reports that include a strategy to discover avulnerability; and

• percentage of bug reports that include vulnerability discov-ery strategies. We use Equation 1 to compute percentage ofbug reports that include a vulnerability discovery strategy.

Proportion of Bug Reports(%) =# of bug reports that include a vulnerability discovery strategy

total # of bug reports in the dataset(1)

3.5 Answer to RQ2: How can we automaticallyidentify bug reports that includevulnerability discovery strategies?

Using raters to identify bug reports that include vulnerability dis-covery strategies require manual effort and may not scale whenanalyzing large amount of bug reports. We hypothesize automatedtechniques that leverage text processing and machine learning canhelp in identifying bug reports with vulnerability discovery strate-gies. Similar to prior work [8] that has investigated reproducibilityof bugs from bug report content, we implement two approachesthat we describe below:

3.5.1 Regular Expression-based Approach. We hypothesizethat bug report descriptions or comments will include certain textpatterns that can be automatically analyzed to identify the pres-ence of vulnerability discovery strategies. We use Figure 3 to il-lustrate our hypothesis even further. From Figure 3 we observe abug report [4] comment to describe steps of actions to reproducea vulnerability. The steps to reproduce follow a certain pattern:first, the ‘Steps To Reproduce’ keyword is used to express that thefollowing steps were used to reproduce the vulnerability. Next, , wedescribe how the vulnerability is discovered by using an itemizedlist of actions (‘1., 2., 3., 4.’). From the presented example in Figure 3,we observe that text patterns that are reflective of vulnerabilitydiscovery strategies exist in bug reports, and we can rely on regularexpressions to capture these patterns.

We rely on regular expressions provided by Chaparro et al. [8]to derive the regular expressions needed to extract the text patternsrelated to vulnerability discovery strategies. We rely on Chaparroet al. [8]’s regular expressions as these regular expressions aresystematically derived from bug reports that express steps of actionson how to reproduce a bug. We present the regular expression thatwe use to automatically detect vulnerability discovery strategyin Figure 4. Our assumption is that by using the same regularexpression we can also effectively identify vulnerability discoverystrategies in bug reports. We evaluate our assumptions furtherusing three measures to determine the performance of our regularexpression-based approach, which are described below:

4

Page 5: Can We Use Software Bug Reports to ... - akondrahman.github.io · 1 INTRODUCTION According to the United States Department of Homeland Secu-rity, the information technology (IT) sector,

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

Can We Use Software Bug Reports to Identify Vulnerability Discovery Strategies? HotSoS ’20, April 7–8, 2020, Lawrence, KS, USA

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

• Precision: Precision measures the proportion of bug reportsthat include vulnerability discovery strategies given that themodel predicts to include vulnerability discovery strategies.We use Equation 2 to calculate precision.

Precision =TP

TP + FP(2)

• Recall: Recall measures the proportion of bug reports thatinclude vulnerability discovery strategies and also correctlypredicted by the prediction model. We use Equation 3 tocalculate recall.

Recall =TP

TP + FN(3)

• F-Measure: F-Measure is the harmonic mean of precisionand recall. Increase in precision, often decreases recall, andvice-versa [25]. F-Measure provides a composite score ofprecision and recall, and is high when both precision andrecall is high.

F −Measure = 2 ×Precision × Recall

Precision + Recall(4)

In Equations 2 and 3, TP, FP, and FN respectively, stands for truepositive, false positive and false negative.

Steps To Reproduce:1. open Nightly with e10s enabled2. open the attached HTML3. select "THIS" text in the file4. drag and drop the text to empty area of tabs bar

Figure 3: Example text pattern in bug report that expresses astep of actions to discover a vulnerability (‘CVE-2017-7812’).

(step(s)?|how) to (reproduce|

recreate|create|replicate)|

(step|repro|repro step|

step to repro)|follow(ing)?

(scenario(s)?|step(s)?):|

^\d+\-+.+|^(\[\d+(\w+)?\]).+|

^(\(\d+(\w+)?\)).+|

^(\{\d+(\w+)?\}).+|

^step\d+ \:.* |^\d+(\.|\)) .*

1

Figure 4: Regular expression used to automatically identifybug reports with vulnerability discovery strategies.

3.5.2 Machine Learning-basedApproach. Wehypothesize thatby extracting text features and applying statistical learners we canbuild models that will automatically predict what bug reports in-clude vulnerability discovery strategies. We summarize the stepsto construct prediction models in Figure 5. We describe the steps toconstruct prediction models below:

Text featureextraction

Statistical learner

10x10-fold validation

Cross datasetevaluation

Bug reports

Figure 5: Steps to construct prediction models to automati-cally identify the presence of vulnerability discovery strate-gies in bug reports.

Index Text feature Frequency

1 open 1

2 nightly 1

3 with 1

4 e10s 1

5 enabled 1

6 open nightly 1

7 nightly with 1

8 with e10s 1

9 e10s enabled 1

10 open nightly with 1

11 nightly with e10s 1

12 with e10s enabled 1

Figure 6: A hypothetical example to demonstrate text fea-ture extraction process described in Section 3.5.2.We extractuni-grams, bi-grams, and tri-grams frombug report content.

Step-1: Text feature extraction For our machine learning-basedapproach we extract text features from the bug report comments.For text feature extraction we apply uni-grams, bi-grams, and tri-grams [21]. Uni-grams, bi-grams, and tri-grams are special casesof n-grams i.e. a contiguous sequence of n tokens from a givensample of text [21]. For uni-gram, bi-gram, and tri-gram, the value

5

Page 6: Can We Use Software Bug Reports to ... - akondrahman.github.io · 1 INTRODUCTION According to the United States Department of Homeland Secu-rity, the information technology (IT) sector,

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

623

624

625

626

627

628

629

630

631

632

633

634

635

636

637

638

HotSoS ’20, April 7–8, 2020, Lawrence, KS, USA Bhuiyan et al.

639

640

641

642

643

644

645

646

647

648

649

650

651

652

653

654

655

656

657

658

659

660

661

662

663

664

665

666

667

668

669

670

671

672

673

674

675

676

677

678

679

680

681

682

683

684

685

686

687

688

689

690

691

692

693

694

695

696

of n is respectively, one, two, and three. Our assumption is that bycapturing these sequences of tokens we may capture text featuresthat can be used to predict the presence of vulnerability discoverystrategies. In the case of uni-grams, we assume to capture specialkeywords that could be used to predict the presence of vulnerabilitydiscovery strategies in a bug report.

We further demonstrate the process of text feature extractionusing a hypothetical example: “open nightly with e10s enabled”. Thesentence is extracted from a Mozilla bug report, which maps to avulnerability indexed by the NVD (CVE-2017-7812) [4]. With theapplication of our text feature extraction process we extract 12features indexed as 1-12 in Figure 6. Text features indexed as 1-5 areuni-grams, 6-9 are bi-grams, and 10-12 are tri-grams. From the 12text features we construct a feature matrix, which is used as inputto the statistical learners.Step-2: Apply statistical learnersResearchers use statistical learn-ers to build predictionmodels that learn from historic data andmakeprediction decisions on unseen data. We use Scikit Learn API [31]to construct prediction models using statistical learners. We use sixstatistical learners that we briefly describe, and reasons for selectingthese learners, as following:

• Classification and Regression Tree (CART): CART generatesa tree based on the impurity measure, and uses that treeto provide decisions based on input features [3]. We selectCART because this learner does not make any assumptionon the distribution of features, and is robust to model over-fitting [3, 36].

• K Nearest Neighbor (KNN): The KNN classification techniquestores all available prediction outcomes based on trainingdata and classifies test data based on similarity measures. Weselect KNN because prior research has reported that defectprediction models that use KNN perform well [20].

• Logistic Regression (LR): LR estimates the probability that adata point belongs to a certain class, given the values of fea-tures [16]. LR provides good performance for classificationif the features are roughly linear [16]. We select LR becausethis learner performs well for classification problems [16],such as defect prediction [33] and fault prediction [20].

• Support Vector Machine (SVM): Support vector machines pre-dict labels by quantifying the amount of separation for fea-tures between multiple classes. We select SVM because priorresearch [10, 27] has reported SVMs to perform well fortext-related classification.

• Random Forest (RF): RF is an ensemble technique that createsmultiple classification trees, each of which is generated bytaking random subsets of the training data [2, 36]. UnlikeLR, RF does not expect features to be linear for good classifi-cation performance. Researchers [19] recommended the useof statistical learners that uses ensemble techniques to builddefect prediction models.

• Deep Neural Network (DNN): Deep neural network is a vari-ant of artificial neural network (ANN), where the count ofhidden layers used within the ANN is multiple, and canvary from three to thousands of hidden layers [23]. Our im-plementation of DNN uses five parameters: first, we use amulti-layer perceptron with 5 hidden layers, where each

Table 2: Selection of Bug Reports for Analysis

Gentoo Libre Mozilla

Initial count 729 908 5,336Criteria-1 (Keyword analysis) 566 89 1,151Criteria-2 (Manual analysis) 9 32 536

Final bug report count 9 32 536

layer is fully connected to the next one. Second, we use thebackpropagation algorithm [39] for training. Third, we usecross-entropy loss function for classification. Fourth, we usethe rectified linear unit function as our activation function,and fifth we train our model for 200 epochs.

Prediction performance measures: As shown in Section 3.5.1, sim-ilar to our evaluation of regular expression-based approach, weuse three measures to evaluate the prediction performance of theconstructed models: precision, recall, and F-measure.Step-3: Evaluation We use two approaches to evaluate our con-structed prediction models, which we describe below:

• 10×10-fold validation: We use 10×10-fold validation to evalu-ate our prediction models. We use this approach by randomlypartitioning the dataset into 10 equal sized subsamples orfolds [36]. The performance of the constructed predictionmodels is tested by using 9 of the 10 folds as training data, andthe remaining fold as test data. Similar to prior research [19],we repeat the 10-fold validation 10 times to avoid predictionerrors. We report the median prediction performance scoreof the 10 runs.

• Cross Dataset Prediction: For cross dataset prediction, webuild prediction models by training the statistical learnerson one dataset and use another dataset for testing. Alongwith 10×10-fold validation we use cross dataset evaluationbecause, the cross dataset evaluation can provide evidenceon how generalizable the machine learning-based approachis, and if there are similarities between datasets with respectto text features that are reflective of vulnerability discoverystrategies. The three datasets yield a total of six train andtest combinations. Similar to prior research [18], we repeatthe cross dataset prediction procedure 10 times and reportthe median prediction performance score of the 10 runs.

4 EMPIRICAL FINDINGSIn this section we report our empirical findings.

4.1 Bug Database MiningBy using the Bugzilla API, we collect 15010, 30000, and 32170 bugreports respectively, for Gentoo Linux, LibreOffice, and Mozilla.

4.2 Bug Report FilteringWe collect 9, 32, and 536 bug reports that map to a CVE indexedin NVD. In Table 2, we report a complete breakdown of how manybug reports are filtered using our criteria mentioned in Section 3.2.

6

Page 7: Can We Use Software Bug Reports to ... - akondrahman.github.io · 1 INTRODUCTION According to the United States Department of Homeland Secu-rity, the information technology (IT) sector,

697

698

699

700

701

702

703

704

705

706

707

708

709

710

711

712

713

714

715

716

717

718

719

720

721

722

723

724

725

726

727

728

729

730

731

732

733

734

735

736

737

738

739

740

741

742

743

744

745

746

747

748

749

750

751

752

753

754

Can We Use Software Bug Reports to Identify Vulnerability Discovery Strategies? HotSoS ’20, April 7–8, 2020, Lawrence, KS, USA

755

756

757

758

759

760

761

762

763

764

765

766

767

768

769

770

771

772

773

774

775

776

777

778

779

780

781

782

783

784

785

786

787

788

789

790

791

792

793

794

795

796

797

798

799

800

801

802

803

804

805

806

807

808

809

810

811

812

Table 3: Agreement Level between Raters to Construct Ora-cle Dataset

Property Gentoo LibreOffice MozillaAgreement 99.7% 98.1% 99.7%Cohen’s κ 0.67 0.74 0.68Interpretation Substantial Substantial Substantial

Table 4: Answer to RQ1: Bug Reports with Strategies to Dis-cover Vulnerabilities

Metric Gentoo Libre Mozilla

Count 9 29 425Proportion of Bug Reports(%) 1.2 3.2 7.9

4.3 Qualitative Rating to Construct StrategyOracle Dataset

We allocate two raters, the first and second authors of the paper toconduct qualitative rating. Both raters individually, examined bugreports for all three datasets to identify which bug reports includestrategies to discover vulnerabilities. We report summary statisticsof the qualitative rating step in Table 3. The Cohen’s Kappa is listedin the ‘Cohen’s κ’ row. The ‘Agreement’ row reports the count ofbug reports for which we observe agreements between the ratingof the first and the second authors. For example, the two authorsagreed on the rating for 99.7% bug reports for Gentoo, and theCohen’s κ is 0.67.

Our results show that the agreement between the raters variedfrom 98.1%∼99.7%. According to Landis and Koch [22], the agree-ment rate is ‘Substantial’ indicating high agreement between thetwo raters. For the disagreements we use the last author as theresolver. Upon resolving disagreements we construct an oracledataset that we use to answer RQ1 and RQ2.

4.4 Answer to RQ1: How frequently dovulnerability discovery strategies appear inbug reports?

We examine bug reports to see if the reports contain strategiesused in vulnerability discovery. The number of bug reports thatincluded vulnerability discovery strategies varied from 1.2%∼7.9%.Proportion-wise the Mozilla dataset has the highest amount of bugreports that include vulnerability discovery strategies: 7.9% of the5,336 bug reports downloaded from Mozilla maps to a vulnerability,and also include the vulnerability discovery strategy in the bugreport. We provide detailed answers in Table 4, where the ‘Propor-tion (%)’ metric shows the proportion of bug reports that includestrategies to discover vulnerabilities. Based on our answers to RQ1we can conclude that bug reports can be used to identify strategiesfor vulnerability discovery.

4.5 Answer to RQ2: How can we automaticallyidentify bug reports that includevulnerability discovery strategies?

We answer RQ2 in this section. We present the results using regularexpression in Table 5, where we report the precision, recall, and F-measure for three datasets: Gentoo, Libre, and Mozilla. Considering

Table 5: Answer to RQ2: Results for the Regular ExpressionApproach

Metric Gentoo Libre MozillaPrecision 0.14 0.19 0.16Recall 0.37 0.79 0.83F-measure 0.20 0.31 0.27

recall, we observe the regular expression approach to perform thebest for the Mozilla dataset with a recall of 0.83. Considering F-measure, LibreOffice performs best: for LibreOffice we observea F-measure of 0.31. We do not observe precision to be >= 0.19for any of the datasets, which suggests regular expression-basedapproaches can generate a lot of false positives.

We present the findings related to machine learning approachwith 10×10-fold validation in Table 6. The ‘Learner’ lists the namesof the learner for which we report precision, recall, and F-measurefor the three datasets: Gentoo (GENT), LibreOffice (LIBR), andMozilla (MOZI). We observe LR to provide the highest F-measurefor Mozilla. Precision is the highest for Mozilla when CART isapplied. The performance is the worst for the Gentoo dataset: nolearner is able to detect the presence of vulnerability discoverystrategy for the dataset. For the Gentoo dataset, precision, recall,and F-measure are respectively, 0.0, 0.0, and 0.0 for all six learners.

We report the cross dataset prediction results in Tables 7, 8,and 9. As described in Section 3.5.2, for cross dataset evaluation,we construct a model with one dataset, and test the performanceof the model using another dataset. Tables 7, 8, and 9 respectively,describe the prediction results, when we train models with theGentoo, LibreOffice, andMozilla datasets.We observe the predictionperformance to be lowest when themodel is trainedwith theGentoodataset. When trained with the Mozilla dataset, the precision, recall,and F-measure of the Gentoo dataset are respectively, 1.00, 0.71,and 0,74, which are higher than that of the 10×10-fold validation.Our findings show that when trained with the Mozilla dataset theprediction performance is higher for the other datasets implyingthat the text features extracted from the Mozilla dataset can be usedto predict the presence of vulnerability discovery strategies in otherdatasets.

5 DISCUSSIONWe discuss the findings of our paper in this section:

5.1 On the Value of Bug Reports to AdvanceUnderstanding of Vulnerabilities

Results from Table 4 show that OSS bug reports contain vulnera-bility discovery strategies. Our reported frequency of vulnerabilitydiscovery strategy is lower than that of bug discovery strategy asreported in prior research [8]. Chaparro et al. [8] studied 2,912 bugreports and observed 51.9% of the studied bug reports to includestrategies on how a bug was discovered.

5.2 Reporting of Vulnerability DiscoveryStrategies in Bug Reports

As shown in Table 4, the proportion of bug reports that includevulnerability discovery strategies vary from 1.2%∼7.9%. One possi-ble explanation can be attributed to lack of context on how a rater

7

Page 8: Can We Use Software Bug Reports to ... - akondrahman.github.io · 1 INTRODUCTION According to the United States Department of Homeland Secu-rity, the information technology (IT) sector,

813

814

815

816

817

818

819

820

821

822

823

824

825

826

827

828

829

830

831

832

833

834

835

836

837

838

839

840

841

842

843

844

845

846

847

848

849

850

851

852

853

854

855

856

857

858

859

860

861

862

863

864

865

866

867

868

869

870

HotSoS ’20, April 7–8, 2020, Lawrence, KS, USA Bhuiyan et al.

871

872

873

874

875

876

877

878

879

880

881

882

883

884

885

886

887

888

889

890

891

892

893

894

895

896

897

898

899

900

901

902

903

904

905

906

907

908

909

910

911

912

913

914

915

916

917

918

919

920

921

922

923

924

925

926

927

928

Table 6: Answer to RQ2: Results for Machine Learning-based Approach with 10×10-fold Validation

Learner Precision Recall F-measureGENT LIBR MOZI GENT LIBR MOZI GENT LIBR MOZI

CART 0.00 0.67 0.90 0.00 0.67 0.71 0.00 0.67 0.79DNN 0.00 0.00 0.82 0.00 0.00 0.77 0.00 0.00 0.79KNN 0.00 0.00 0.64 0.00 0.00 0.11 0.00 0.00 0.18LR 0.00 0.50 0.87 0.00 0.33 0.78 0.00 0.40 0.82RF 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00SVM 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Table 7: Answer toRQ2: Results forMachine Learning-basedApproachwith Cross-dataset Evaluation. Training dataset isGentoo.

Precision Recall F-measureLIBR MOZI LIBR MOZI LIBR MOZI

CART 0.00 0.00 0.00 0.00 0.00 0.00DNN 0.00 0.26 0.00 0.11 0.00 0.15KNN 0.00 0.00 0.00 0.00 0.00 0.00LR 0.00 0.12 0.00 0.00 0.00 0.00RF 0.00 0.00 0.00 0.00 0.00 0.00SVM 0.00 0.00 0.00 0.00 0.00 0.00

Table 8: Answer toRQ2: Results forMachine Learning-basedApproachwith Cross-dataset Evaluation. Training dataset isLibreOffice.

Precision Recall F-measureGENT MOZI GENT MOZI GENT MOZI

CART 0.00 0.53 0.00 0.65 0.00 0.59DNN 0.00 0.59 0.00 0.09 0.00 0.15KNN 0.00 0.00 0.00 0.00 0.00 0.00LR 0.00 0.50 0.00 0.04 0.00 0.07RF 0.00 0.00 0.00 0.00 0.00 0.00SVM 0.00 0.00 0.00 0.00 0.00 0.00

Table 9: Answer toRQ2: Results forMachine Learning-basedApproachwith Cross-dataset Evaluation. Training dataset isMozilla.

Precision Recall F-measureGENT LIBR GENT LIBR GENT LIBR

CART 0.77 0.67 0.71 0.07 0.74 0.12DNN 0.57 0.71 0.71 0.17 0.63 0.27KNN 1.00 0.00 0.25 0.00 0.40 0.00LR 0.83 0.70 0.64 0.24 0.36 0.73RF 0.00 0.00 0.02 0.00 0.04 0.00SVM 1.00 0.00 0.00 0.00 0.00 0.00

perceives the content of the bug report. Description of the discoverystrategy might make more sense to the person who performed thediscovery but due to lack of context, a rater, who is well-versed insoftware security may not understand the steps of actions on howto perform the discovery. A set of guidelines on proper reportingof vulnerability discovery strategies can be helpful for security re-searchers who do not actively contribute to OSS projects, but needto report vulnerabilities.

Lack of context to reproduce a bug is common in software de-velopment. In 2016, developers who host their projects on GitHub,signed and sent a petition to GitHub, where they stated that bug

reports “... are often filed missing crucial information like reproduc-tion steps” 12. Researchers [15, 41, 42] have reported the negativeimpact of incomplete bug reports. For example, non-reproduciblebugs [15], unfixed bugs [41], and delayed bug resolution [42] areoften attributed to incomplete information related to bug descrip-tions.

5.3 AutomationOur findings reported in Section 4.5 provide evidence that thetwo automation strategies we used have limitations. As shown inTable 5, using regular expressions, the highest F-measure is 0.31,which is subject to improvement. With 10×10-fold validation thehighest F-measure is 0.82, but 0.00 for Gentoo. Furthermore, fromSection 4.5 we observe that when building models with the Mozilladataset, the prediction performance measure is relatively higherfor cross-dataset prediction results. One possible explanation canbe the text patterns that appear in the Mozilla dataset are capableof separating out bug reports that include vulnerability discoverystrategies and bug reports that do not. Our findings suggest thatautomated detection of vulnerability discovery strategy in bugreports is related to the textual content presented in the bug reports.

5.4 Future ResearchFindings from our paper provide opportunities to pursue the fol-lowing research directions:

• Strategy Mining: Researchers can use bug reports to identifywhat strategies are used to discover vulnerabilities. A synthesisof vulnerability discovery strategies can be helpful for IT orga-nizations who do not have security experts in their team, butare interested in discovery and repair of software vulnerabilitiesbefore they are released to the end-users. Munaiah et al. [28]stressed the importance of identifying vulnerability discoverystrategies from available software artifacts: “As the complexityof the software system increases, speculating the different ways inwhich an attacker could exploit a vulnerability becomes a dauntingtask. The key, therefore, is to leverage empirical information tocharacterize the typical ways in which attackers tend to discoverand exploit vulnerabilities.”.While pursuing this research project, researchers may benefitfrom applying qualitative analysis with multiple raters, as au-tomated techniques, such as the use of regular expressions andtext feature-based prediction models are limited with respectto detection accuracy. In addition, researchers can compare andcontrast the findings obtained from bug reports to those obtained

12https://github.com/dear-github/dear-github

8

Page 9: Can We Use Software Bug Reports to ... - akondrahman.github.io · 1 INTRODUCTION According to the United States Department of Homeland Secu-rity, the information technology (IT) sector,

929

930

931

932

933

934

935

936

937

938

939

940

941

942

943

944

945

946

947

948

949

950

951

952

953

954

955

956

957

958

959

960

961

962

963

964

965

966

967

968

969

970

971

972

973

974

975

976

977

978

979

980

981

982

983

984

985

986

Can We Use Software Bug Reports to Identify Vulnerability Discovery Strategies? HotSoS ’20, April 7–8, 2020, Lawrence, KS, USA

987

988

989

990

991

992

993

994

995

996

997

998

999

1000

1001

1002

1003

1004

1005

1006

1007

1008

1009

1010

1011

1012

1013

1014

1015

1016

1017

1018

1019

1020

1021

1022

1023

1024

1025

1026

1027

1028

1029

1030

1031

1032

1033

1034

1035

1036

1037

1038

1039

1040

1041

1042

1043

1044

from other data sources, such as capture the flag (CTF) com-petitions. Recently, researchers [28] have mined CTF data andobserved that while discovering vulnerabilities, CTF participantsdiscover vulnerabilities in a sequential manner, which is similarto our definition of vulnerability discovery strategy. Also, re-searchers can investigate who performs the discovery strategies,their experience in vulnerability discovery, and how difficult it isto perform the discovery. Our findings provide the foundationfor further research in the domain of vulnerability understandingand characterization.

• Improving Automated Detection Strategies: We have dis-cussed how our automated approaches to detect vulnerabilitydiscovery strategy is limited with respect to detection accuracy.We advocate for future research that will investigate if other textmining techniques, such as topic modeling [38], word2vec [26],and sequential text pattern mining [24] can effectively detectpresence of vulnerability discovery strategies.Currently, our automated approach does not show predictionperformance that is consistent across all datasets. For example,for the Gentoo dataset, both the machine learning approachand regular expression-based approach showed lower detectionaccuracy results compared to those of Mozilla and LibreOffice.Apart from text mining and regular expressions, researcherscan investigate what other techniques can be used to build auniversal classifier that is capable of detecting the presence ofvulnerability discovery strategies for multiple datasets collectedfrom a wide range of domains, such as browsers (Mozilla) andoperating systems (Gentoo).

• Towards Better Reporting of Vulnerability Discovery: Ourfindings show evidence that vulnerability discovery strategiesmay not be adequately reported in bug reports. For OSS projects,practitioners who report vulnerability discovery strategies mightbe peripheral contributors [34] i.e. contributors who do not reg-ularly engage in regular software development and maintenancetasks, and may not be aware of the practices on how to reportvulnerability discovery strategies. Concrete guidelines could behelpful in proper reporting of vulnerability discovery strategies.

6 THREATS TO VALIDITYWe discuss the limitations of our paper as following:

• Conclusion Validity: All of our findings are derived from threedatasets, which are constructed by the two raters. The data con-struction process is susceptible to rater judgment. The raters’experience can bias the process of identifying what bug reportsinclude strategies to discover vulnerabilities. We mitigate thislimitation by assigning at least two raters. We also use a resolverto resolve the disagreements between the two raters.Prior work [17, 37] has underlined the importance of tuningparameters of statistical learners to get better prediction per-formance. While building prediction models we didn’t tune thestatistical learners. For CART, KNN, LR, RF, and SVM we relyon the default parameters provided by the Scikit Learn API [31].For DNN, the parameters are determined by the first author’sjudgment. We acknowledge that not tuning parameters of statis-tical learners might influence the prediction performance of theconstructed models.

• ConstructValidity:We use raters to construct the oracle dataset.Our process is susceptible to mono-method bias where subjectivejudgment of raters can influence the findings. We mitigate thisthreat by using two raters and one resolver.

• External Validity: Our results are subject to external validity aswe used three datasets collected from the OSS domain. Findingsfrom our paper may be limited and may not generalize to otherdatasets. We mitigate this limitation by using datasets from threeorganizations who deliver a variety of software products to theend-users.

• Internal Validity: From the bug reports we extract commentsand descriptions to collect necessary text to determine if a bugreport includes strategies to discover vulnerabilities. The descrip-tions and comments available in the bug reports may not becomprehensive to provide enough context to determine if a strat-egy is present. We rely on CVEs reported in the bug reports toidentify those that are related to vulnerabilities. The bug reportsmay include discussion on vulnerabilities that are not confirmedand indexed in NVD. Our analysis may miss latent vulnerabili-ties and unconfirmed vulnerabilities, which could influence theresults.

7 CONCLUSIONIdentifying and characterizing the nature of vulnerabilities havegained a lot of interest amongst researchers. However, data sourcesto study vulnerabilities can be limiting. One approach to mitigatethis limitation is to investigate if bug reports include descriptionsof how vulnerabilities are discovered. We hypothesized that bugreports could include descriptions of how vulnerabilities are dis-covered, which could help researchers to further identify whatstrategies practitioners execute to discover vulnerabilities for OSS.We evaluated our hypothesis by conducting an empirical analysiswith 729, 908, and 5336 bug reports respectively, collected fromGentoo, LibreOffice, and Mozilla.

We observe OSS bug reports to include descriptions on how todiscover vulnerability strategies: the proportion of bug reports thatincluded vulnerability discovery strategies varied from 1.2%∼7.9%.Based on our answers to RQ1 we can conclude that software bug re-ports include information that can be used to identify strategies forvulnerability discovery. Our findings also suggest that automateddetection of vulnerability discovery strategy is dependent on thetext features that we mine from bug reports, and is sensitive towhat dataset is being used. For example, we observe the highestdetection performance for Mozilla using the machine learning ap-proach. Using the regular expression-based approach we observethe highest detection performance for LibreOffice. Based on ourempirical study we recommend researchers to apply qualitativeanalysis to identify vulnerability discovery strategies, as automatedapproaches may be limiting to detect the presence of vulnerabilitydiscovery strategies in OSS bug reports. We hope our paper will ad-vance research in the area focused on characterizing vulnerabilitiesin OSS projects.

ACKNOWLEDGMENTSWe thank members of the PASER group at Tennessee TechnologicalUniversity for their valuable feedback on the paper.

9

Page 10: Can We Use Software Bug Reports to ... - akondrahman.github.io · 1 INTRODUCTION According to the United States Department of Homeland Secu-rity, the information technology (IT) sector,

1045

1046

1047

1048

1049

1050

1051

1052

1053

1054

1055

1056

1057

1058

1059

1060

1061

1062

1063

1064

1065

1066

1067

1068

1069

1070

1071

1072

1073

1074

1075

1076

1077

1078

1079

1080

1081

1082

1083

1084

1085

1086

1087

1088

1089

1090

1091

1092

1093

1094

1095

1096

1097

1098

1099

1100

1101

1102

HotSoS ’20, April 7–8, 2020, Lawrence, KS, USA Bhuiyan et al.

1103

1104

1105

1106

1107

1108

1109

1110

1111

1112

1113

1114

1115

1116

1117

1118

1119

1120

1121

1122

1123

1124

1125

1126

1127

1128

1129

1130

1131

1132

1133

1134

1135

1136

1137

1138

1139

1140

1141

1142

1143

1144

1145

1146

1147

1148

1149

1150

1151

1152

1153

1154

1155

1156

1157

1158

1159

1160

REFERENCES[1] N. Ali, A. Sabané, Y. Guéhéneuc, and G. Antoniol. 2012. Improving Bug Loca-

tion Using Binary Class Relationships. In 2012 IEEE 12th International Work-ing Conference on Source Code Analysis and Manipulation. 174–183. https://doi.org/10.1109/SCAM.2012.26

[2] Leo Breiman. 2001. Random Forests. Machine Learning 45, 1 (2001), 5–32. https://doi.org/10.1023/A:1010933404324

[3] Leo Breiman et al. 1984. Classification and Regression Trees (1st ed.). Chapman &Hall, New York. 358 pages. http://www.crcpress.com/catalog/C4841.htm

[4] Bugzilla. 2017. Web content can open local files by hooking drag and dropto outside of content. https://bugzilla.mozilla.org/show_bug.cgi?id=1379842.[Online; accessed 21-December-2019].

[5] Oscar Chaparro, Carlos Bernal-Cardenas, Jing Lu, Kevin Moran, Andrian Marcus,Massimiliano Di Penta, Denys Poshyvanyk, and Vincent Ng. 2019. Assessing theQuality of the Steps to Reproduce in Bug Reports. In Proceedings of the 2019 27thACM Joint Meeting on European Software Engineering Conference and Symposiumon the Foundations of Software Engineering (ESEC/FSE 2019). ACM, New York, NY,USA, 86–96. https://doi.org/10.1145/3338906.3338947

[6] O. Chaparro, J. M. Florez, and A. Marcus. 2016. On the Vocabulary Agreement inSoftware Issue Descriptions. In 2016 IEEE International Conference on SoftwareMaintenance and Evolution (ICSME). 448–452. https://doi.org/10.1109/ICSME.2016.44

[7] Oscar Chaparro, Juan Manuel Florez, and Andrian Marcus. 2019. Using bugdescriptions to reformulate queries during text-retrieval-based bug localization.Empirical Software Engineering 24, 5 (01 Oct 2019), 2947–3007. https://doi.org/10.1007/s10664-018-9672-z

[8] Oscar Chaparro, Jing Lu, Fiorella Zampetti, Laura Moreno, Massimiliano Di Penta,Andrian Marcus, Gabriele Bavota, and Vincent Ng. 2017. Detecting MissingInformation in Bug Descriptions. In Proceedings of the 2017 11th Joint Meeting onFoundations of Software Engineering (ESEC/FSE 2017). ACM, New York, NY, USA,396–407. https://doi.org/10.1145/3106237.3106285

[9] Jacob Cohen. 1960. A Coefficient of Agreement for Nominal Scales. Educationaland Psychological Measurement 20, 1 (1960), 37–46. https://doi.org/10.1177/001316446002000104 arXiv:http://dx.doi.org/10.1177/001316446002000104

[10] Fabrice Colas and Pavel Brazdil. 2006. Comparison of SVM and Some OlderClassification Algorithms in Text Classification Tasks. In Artificial Intelligence inTheory and Practice, Max Bramer (Ed.). Springer US, Boston, MA, 169–178.

[11] National Vulnerability Database. 2019. NVD-Vulnerabiliites. https://nvd.nist.gov/vuln. [Online; accessed 22-August-2019].

[12] Tezcan Dilshener, Michel Wermelinger, and Yijun Yu. 2016. Locating BugsWithout Looking Back. In Proceedings of the 13th International Conference onMining Software Repositories (MSR ’16). ACM, New York, NY, USA, 286–290.https://doi.org/10.1145/2901739.2901775

[13] PC Doctor Inc. 2019. PC Diagnostic & System Information Solutions Pre-installedon PC/Android Systems. https://www.pc-doctor.com/solutions/oems. [Online;accessed 14-Nov-2019].

[14] DZone. 2019. Millions of Dell PCs Vulnerable to Flaw in Third-Party Compo-nent. https://threatpost.com/millions-of-dell-pcs-vulnerable-to-flaw-in-third-party-component/145833/. [Online; accessed 08-Aug-2019].

[15] Mona Erfani Joorabchi, Mehdi Mirzaaghaei, and Ali Mesbah. 2014. Works forMe! Characterizing Non-reproducible Bug Reports. In Proceedings of the 11thWorking Conference on Mining Software Repositories (MSR 2014). ACM, New York,NY, USA, 62–71. https://doi.org/10.1145/2597073.2597098

[16] David Freedman. 2005. Statistical Models : Theory and Practice. CambridgeUniversity Press.

[17] Wei Fu, Tim Menzies, and Xipeng Shen. 2016. Tuning for software analytics: Isit really necessary? Information and Software Technology 76 (2016), 135 – 146.https://doi.org/10.1016/j.infsof.2016.04.017

[18] Takafumi Fukushima, Yasutaka Kamei, Shane McIntosh, Kazuhiro Yamashita, andNaoyasu Ubayashi. 2014. An Empirical Study of Just-in-time Defect PredictionUsing Cross-project Models. In Proceedings of the 11th Working Conference onMining Software Repositories (MSR 2014). ACM, New York, NY, USA, 172–181.https://doi.org/10.1145/2597073.2597075

[19] Baljinder Ghotra, Shane McIntosh, and Ahmed E. Hassan. 2015. Revisiting theImpact of Classification Techniques on the Performance of Defect PredictionModels. In Proceedings of the 37th International Conference on Software Engineering- Volume 1 (ICSE ’15). IEEE Press, Piscataway, NJ, USA, 789–800. http://dl.acm.org/citation.cfm?id=2818754.2818850

[20] T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell. 2012. A SystematicLiterature Review on Fault Prediction Performance in Software Engineering.IEEE Transactions on Software Engineering 38, 6 (Nov 2012), 1276–1304. https://doi.org/10.1109/TSE.2011.103

[21] Daniel Jurafsky and James H. Martin. 2009. Speech and Language Processing (2NdEdition). Prentice-Hall, Inc., Upper Saddle River, NJ, USA.

[22] J. Richard Landis and Gary G. Koch. 1977. The Measurement of ObserverAgreement for Categorical Data. Biometrics 33, 1 (1977), 159–174. http://www.jstor.org/stable/2529310

[23] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. nature521, 7553 (2015), 436–444.

[24] David D Lewis and William A Gale. 1994. A sequential algorithm for trainingtext classifiers. In SIGIR’94. Springer, 3–12.

[25] Tim Menzies, Alex Dekhtyar, Justin Distefano, and Jeremy Greenwald. 2007.Problems with Precision: A Response to "Comments on ’Data Mining Static CodeAttributes to Learn Defect Predictors’". IEEE Trans. Softw. Eng. 33, 9 (Sept. 2007),637–640. https://doi.org/10.1109/TSE.2007.70721

[26] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.Distributed Representations of Words and Phrases and their Compositionality. InAdvances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou,M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.). Curran Associates,Inc., 3111–3119. http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf

[27] Vikramjit Mitra, Chia-Jiu Wang, and Satarupa Banerjee. 2007. Text Classification:A Least Square Support Vector Machine Approach. Appl. Soft Comput. 7, 3 (June2007), 908–914. https://doi.org/10.1016/j.asoc.2006.04.002

[28] N. Munaiah, A. Rahman, J. Pelletier, L. Williams, and A. Meneely. 2019. Charac-terizing Attacker Behavior in a Cybersecurity Penetration Testing Competition.In 2019 ACM/IEEE International Symposium on Empirical Software Engineeringand Measurement (ESEM). 1–6. https://doi.org/10.1109/ESEM.2019.8870147

[29] Department of Homeland Security. 2019. Information Technology Sector. https://www.dhs.gov/cisa/information-technology-sector. [Online; accessed 15-Nov-2019].

[30] Science of Security and Privacy. 2010. Cyber Security – Is Science Possible?https://cps-vo.org/node/624. [Online; accessed 24-December-2019].

[31] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel,Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, RonWeiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau,Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn:Machine Learning in Python. J. Mach. Learn. Res. 12 (Nov. 2011), 2825–2830.http://dl.acm.org/citation.cfm?id=1953048.2078195

[32] F. Peters, T. T. Tun, Y. Yu, and B. Nuseibeh. 2019. Text Filtering and Ranking forSecurity Bug Report Prediction. IEEE Transactions on Software Engineering 45, 6(June 2019), 615–631. https://doi.org/10.1109/TSE.2017.2787653

[33] Foyzur Rahman and Premkumar Devanbu. 2013. How, and Why, ProcessMetrics Are Better. In Proceedings of the 2013 International Conference on Soft-ware Engineering (ICSE ’13). IEEE Press, Piscataway, NJ, USA, 432–441. http://dl.acm.org/citation.cfm?id=2486788.2486846

[34] Pankaj Setia, Balaji Rajagopalan, Vallabh Sambamurthy, and Roger Calantone.2012. How peripheral developers contribute to open-source software develop-ment. Information Systems Research 23, 1 (2012), 144–163.

[35] J. D. Strate and P. A. Laplante. 2013. A Literature Review of Research in SoftwareDefect Reporting. IEEE Transactions on Reliability 62, 2 (June 2013), 444–454.https://doi.org/10.1109/TR.2013.2259204

[36] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. 2005. Introduction to DataMining, (First Edition). Addison-Wesley Longman Publishing Co., Inc., Boston,MA, USA.

[37] Chakkrit Tantithamthavorn, ShaneMcIntosh, Ahmed E. Hassan, and Kenichi Mat-sumoto. 2016. Automated Parameter Optimization of Classification Techniquesfor Defect Prediction Models. In Proceedings of the 38th International Confer-ence on Software Engineering (ICSE ’16). ACM, New York, NY, USA, 321–332.https://doi.org/10.1145/2884781.2884857

[38] Hanna MWallach. 2006. Topic modeling: beyond bag-of-words. In Proceedings ofthe 23rd international conference on Machine learning. ACM, 977–984.

[39] Bernard Widrow and Michael A Lehr. 1990. 30 years of adaptive neural networks:perceptron, madaline, and backpropagation. Proc. IEEE 78, 9 (1990), 1415–1442.

[40] Yu Zhao, Kye Miller, Tingting Yu, Wei Zheng, and Minchao Pu. 2019. Automat-ically Extracting Bug Reproducing Steps from Android Bug Reports. In Reusein the Big Data Era, Xin Peng, Apostolos Ampatzoglou, and Tanmay Bhowmik(Eds.). Springer International Publishing, Cham, 100–111.

[41] Thomas Zimmermann, Nachiappan Nagappan, Philip J. Guo, and Brendan Mur-phy. 2012. Characterizing and Predicting Which Bugs Get Reopened. In Pro-ceedings of the 34th International Conference on Software Engineering (ICSE ’12).IEEE Press, Piscataway, NJ, USA, 1074–1083. http://dl.acm.org/citation.cfm?id=2337223.2337363

[42] T. Zimmermann, R. Premraj, N. Bettenburg, S. Just, A. Schroter, and C. Weiss.2010. What Makes a Good Bug Report? IEEE Transactions on Software Engineering36, 5 (Sep. 2010), 618–643. https://doi.org/10.1109/TSE.2010.63

10