25
MINING STACK OVERFLOW AN ACCOUNT OF EXPLORATORY RESEARCH Created by / Chris Foster @chrisfosterelli

MINING STACK OVERFLOW - Chris Foster · 2020. 7. 6. · Development Emails Content Analyzer: Intention Mining in Developer Discussions. KEY FINDINGS 1. Mailing lists are less important

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: MINING STACK OVERFLOW - Chris Foster · 2020. 7. 6. · Development Emails Content Analyzer: Intention Mining in Developer Discussions. KEY FINDINGS 1. Mailing lists are less important

MINING STACK OVERFLOW

AN ACCOUNT OF EXPLORATORY RESEARCHCreated by / Chris Foster @chrisfosterelli

Page 2: MINING STACK OVERFLOW - Chris Foster · 2020. 7. 6. · Development Emails Content Analyzer: Intention Mining in Developer Discussions. KEY FINDINGS 1. Mailing lists are less important

RESEARCH GOALS1. Explore techniques for mining text and web data2. Understand how to apply learning algorithms3. Understand how to evaluate feature performance4. Develop a program to automate record analysis5. Compare and contrast literature work with results

Page 3: MINING STACK OVERFLOW - Chris Foster · 2020. 7. 6. · Development Emails Content Analyzer: Intention Mining in Developer Discussions. KEY FINDINGS 1. Mailing lists are less important

INITIAL RESEARCHWe started with a literature review of nine papers:

A Premliminary Psychometric Text Analysis of the Apache Developer Mailing ListTechniques for Identifying the Country Origin of Mailing List ParicipantsAn Analysis of Newbies' First Interactions on Project Mailing ListsContent Classification of Developer EmailsCommunication in Open Source So�ware Development Mailing ListsReviewer Recommendation to Expedite Crowd CollaborationMining Developer Mailing List to Predict So�ware DefectsAutomatically Prioritizing Pull RequestsDevelopment Emails Content Analyzer: Intention Mining in Developer Discussions

Page 4: MINING STACK OVERFLOW - Chris Foster · 2020. 7. 6. · Development Emails Content Analyzer: Intention Mining in Developer Discussions. KEY FINDINGS 1. Mailing lists are less important

KEY FINDINGS1. Mailing lists are less important today than previously2. Many papers had low or very low accuracy rates3. Many papers had no clear pracitcal application

Page 5: MINING STACK OVERFLOW - Chris Foster · 2020. 7. 6. · Development Emails Content Analyzer: Intention Mining in Developer Discussions. KEY FINDINGS 1. Mailing lists are less important

We wanted to find something with:

A clear, useful applicationSubstantial room for future workDatasets other than traditional mailing lists

Page 6: MINING STACK OVERFLOW - Chris Foster · 2020. 7. 6. · Development Emails Content Analyzer: Intention Mining in Developer Discussions. KEY FINDINGS 1. Mailing lists are less important

We added two new papers to the list:

Predicting Response Time in Stack OverflowImproving Low Quality Stack Overflow Post Detection

First step is to replicate the work done in the original paper

Page 7: MINING STACK OVERFLOW - Chris Foster · 2020. 7. 6. · Development Emails Content Analyzer: Intention Mining in Developer Discussions. KEY FINDINGS 1. Mailing lists are less important

SOFTWARE USED

PythonNumpyScipyScikit-learn

Page 8: MINING STACK OVERFLOW - Chris Foster · 2020. 7. 6. · Development Emails Content Analyzer: Intention Mining in Developer Discussions. KEY FINDINGS 1. Mailing lists are less important

DATA SOURCEStack Overflow provides public data dumpsExpanded dataset is a 39GB XML fileThis saved us substantial time!

Page 9: MINING STACK OVERFLOW - Chris Foster · 2020. 7. 6. · Development Emails Content Analyzer: Intention Mining in Developer Discussions. KEY FINDINGS 1. Mailing lists are less important

INITIAL REPLICATION: FILTERINGWe had to reduce the dataset to the specified periodOnly posts between May 1st and August 1st, 2014Only posts that are a question or answerConverts post format from XML into JSONReduces the dataset to 1,307,172 posts

Page 10: MINING STACK OVERFLOW - Chris Foster · 2020. 7. 6. · Development Emails Content Analyzer: Intention Mining in Developer Discussions. KEY FINDINGS 1. Mailing lists are less important

INITIAL REPLICATION: GENERATIONThe next step is to generate tag-based featuresAt this point we have to filter out "unpopular tags"[1]

The authors use three key features: RSR[2], ASR[3], and PR[4]

1: 15 unique contributors for that tag 2: users with avg response time below 2hr for that tag / total users 3: users with at least 10 answers for that tag / total users 4: number of tag occurances / total tag occurances

We generate these using internal maps and cachesThe CSV output contains 266,482 questions

Page 11: MINING STACK OVERFLOW - Chris Foster · 2020. 7. 6. · Development Emails Content Analyzer: Intention Mining in Developer Discussions. KEY FINDINGS 1. Mailing lists are less important

INITIAL REPLICATION: ANALYSISK-Means clustering to group response timesK-Nearest-Neighbours classification engineK-Fold cross-validation to check accuracyParameters: 25 bins, 10 neighbours, and 10 folds

We could successfully reproduce the paper's success rate!Success rate: 32.4%

Page 12: MINING STACK OVERFLOW - Chris Foster · 2020. 7. 6. · Development Emails Content Analyzer: Intention Mining in Developer Discussions. KEY FINDINGS 1. Mailing lists are less important

So how can we improve this?

Page 13: MINING STACK OVERFLOW - Chris Foster · 2020. 7. 6. · Development Emails Content Analyzer: Intention Mining in Developer Discussions. KEY FINDINGS 1. Mailing lists are less important

EXPERIMENT 1Features use many "magic numbers"Can we vary unique contributor count? (X)Can we vary the avg. 2hr response cutoff? (Y)Can we vary the minimum answer count? (Z)

Page 14: MINING STACK OVERFLOW - Chris Foster · 2020. 7. 6. · Development Emails Content Analyzer: Intention Mining in Developer Discussions. KEY FINDINGS 1. Mailing lists are less important

RESULTS

Page 15: MINING STACK OVERFLOW - Chris Foster · 2020. 7. 6. · Development Emails Content Analyzer: Intention Mining in Developer Discussions. KEY FINDINGS 1. Mailing lists are less important

INTEPRETATIONResults indicate little variation in successNo predictable pattern indicated by changing valuesSmall changes accounted for by KMeans randomizationOther parameters are likely "making up" for failures

Page 16: MINING STACK OVERFLOW - Chris Foster · 2020. 7. 6. · Development Emails Content Analyzer: Intention Mining in Developer Discussions. KEY FINDINGS 1. Mailing lists are less important

EXPERIMENT 2Do these results generalize to longer time periods?Do these results generalize to other time periods?

Page 17: MINING STACK OVERFLOW - Chris Foster · 2020. 7. 6. · Development Emails Content Analyzer: Intention Mining in Developer Discussions. KEY FINDINGS 1. Mailing lists are less important

RESULTS

Page 18: MINING STACK OVERFLOW - Chris Foster · 2020. 7. 6. · Development Emails Content Analyzer: Intention Mining in Developer Discussions. KEY FINDINGS 1. Mailing lists are less important

INTERPRETATIONResults can generalize to different yearsResults can generalize to larger datasetsExtending the dataset may slightly improve accuracy

Page 19: MINING STACK OVERFLOW - Chris Foster · 2020. 7. 6. · Development Emails Content Analyzer: Intention Mining in Developer Discussions. KEY FINDINGS 1. Mailing lists are less important

EXPERIMENT 3Do Neural Networks improve the success rate?Do Support Vector Machines improve the success rate?

Page 20: MINING STACK OVERFLOW - Chris Foster · 2020. 7. 6. · Development Emails Content Analyzer: Intention Mining in Developer Discussions. KEY FINDINGS 1. Mailing lists are less important

RESULTS

Page 21: MINING STACK OVERFLOW - Chris Foster · 2020. 7. 6. · Development Emails Content Analyzer: Intention Mining in Developer Discussions. KEY FINDINGS 1. Mailing lists are less important

INTERPRETATIONBoth algorithms appear to only slightly increase accuracyBoth algorithms also take substantially longer to runBoth algorithms run with identical performanceThis suggests that features are the problem

Page 22: MINING STACK OVERFLOW - Chris Foster · 2020. 7. 6. · Development Emails Content Analyzer: Intention Mining in Developer Discussions. KEY FINDINGS 1. Mailing lists are less important

CHALLENGESWorking with very large filesReplicating vague original workFinding an area to focus on

Page 23: MINING STACK OVERFLOW - Chris Foster · 2020. 7. 6. · Development Emails Content Analyzer: Intention Mining in Developer Discussions. KEY FINDINGS 1. Mailing lists are less important

FUTURE WORKExperiment on value ranges with feature isolationIntroduce new features for higher accuracyApply this technique to other datasets

Page 24: MINING STACK OVERFLOW - Chris Foster · 2020. 7. 6. · Development Emails Content Analyzer: Intention Mining in Developer Discussions. KEY FINDINGS 1. Mailing lists are less important

CONCLUSIONIn this research we were able to:

Compare and contrast current literatureExplore techniques for data mining Stack OverflowApply learning algorithms for response time predictionEvaluate feature performance of existing algorithmEvaluate different time ranges on existing algorithmEvaluate different algorithms on the dataset

Page 25: MINING STACK OVERFLOW - Chris Foster · 2020. 7. 6. · Development Emails Content Analyzer: Intention Mining in Developer Discussions. KEY FINDINGS 1. Mailing lists are less important

THE END- -

Chris FosterThompson Rivers University