12
EXPERIMENTING WITH TEXT CLASSIFICATION ALGORITHMS IN NEWS ARTICLES: SVM VS. NAIVE BAYESIAN ALGORITHM NUHI BESIMI, ADRIAN BESIMI, VISAR SHEHU DAAD: 15TH WORKSHOP “SOFTWARE ENGINEERING EDUCATION AND REVERSE ENGINEERING”, BOHINJ, SLOVENIA 1

EXPERIMENTING WITH TEXT CLASSIFICATION ALGORITHMS IN NEWS ARTICLES: SVM VS. NAIVE BAYESIAN ALGORITHM NUHI BESIMI, ADRIAN BESIMI, VISAR SHEHU DAAD: 15TH

Embed Size (px)

Citation preview

Page 1: EXPERIMENTING WITH TEXT CLASSIFICATION ALGORITHMS IN NEWS ARTICLES: SVM VS. NAIVE BAYESIAN ALGORITHM NUHI BESIMI, ADRIAN BESIMI, VISAR SHEHU DAAD: 15TH

EXPERIMENTING WITH TEXT CLASSIFICATION ALGORITHMS IN NEWS ARTICLES: SVM VS. NAIVE BAYESIAN ALGORITHM N U H I B E S I M I , A D R I A N B E S I M I , V I S A R S H E H U

D A A D : 1 5 T H W O R K S H O P “ S O F T WA R E E N G I N E E R I N G E D U C AT I O N A N D R E V E R S E E N G I N E E R I N G ” , B O H I N J ,

S L O V E N I A1

Page 2: EXPERIMENTING WITH TEXT CLASSIFICATION ALGORITHMS IN NEWS ARTICLES: SVM VS. NAIVE BAYESIAN ALGORITHM NUHI BESIMI, ADRIAN BESIMI, VISAR SHEHU DAAD: 15TH

D A A D : 1 5 T H W O R K S H O P “ S O F T WA R E E N G I N E E R I N G ED U C AT I O N A N D R E V E R S E E N G I N E E R I N G ” , B O H I N J , S LO V E N I A

Collected Data

Data Pre-processing

The Naïve Bayes Classifier

SVM (Support Vector Machine)

Experiment and Evaluation Accuracy Execution TimeFuture work

Content

2

Page 3: EXPERIMENTING WITH TEXT CLASSIFICATION ALGORITHMS IN NEWS ARTICLES: SVM VS. NAIVE BAYESIAN ALGORITHM NUHI BESIMI, ADRIAN BESIMI, VISAR SHEHU DAAD: 15TH

D A A D : 1 5 T H W O R K S H O P “ S O F T WA R E E N G I N E E R I N G ED U C AT I O N A N D R E V E R S E E N G I N E E R I N G ” , B O H I N J , S LO V E N I A

Sources: CNET – http://cnet.com PCWorld – http://pcworld.com TechCrunch – http://techcrunch.com NyTimes – http://nytimes.com Goal – http://goal.com

Categories Politics Technology Sports

Collected Data

3

Page 4: EXPERIMENTING WITH TEXT CLASSIFICATION ALGORITHMS IN NEWS ARTICLES: SVM VS. NAIVE BAYESIAN ALGORITHM NUHI BESIMI, ADRIAN BESIMI, VISAR SHEHU DAAD: 15TH

D A A D : 1 5 T H W O R K S H O P “ S O F T WA R E E N G I N E E R I N G ED U C AT I O N A N D R E V E R S E E N G I N E E R I N G ” , B O H I N J , S LO V E N I A

Collected Data (summary)Politics News

Articles

Technology News

Articles

Sports News

Articles

Total

Training Data 200 (80 %) 345 (80 %) 409 (80 %) 954

Testing Data 49 (20 %) 86 (20 %) 102 (20 %) 237

Total 249 431 511 1191

CNET PCWorld TechCrunch NyTimes Goal

Number of collected

documents (news

articles)

81 229 121 570 190

4

Page 5: EXPERIMENTING WITH TEXT CLASSIFICATION ALGORITHMS IN NEWS ARTICLES: SVM VS. NAIVE BAYESIAN ALGORITHM NUHI BESIMI, ADRIAN BESIMI, VISAR SHEHU DAAD: 15TH

D A A D : 1 5 T H W O R K S H O P “ S O F T WA R E E N G I N E E R I N G ED U C AT I O N A N D R E V E R S E E N G I N E E R I N G ” , B O H I N J , S LO V E N I A

Data Cleaning: Stop-word removal Stemming (Porter Algorithm) Low term frequency filtering

(count < 3)

Data Transformation: Bag of words model (vector representation)

Data Pre-processing

5

Page 6: EXPERIMENTING WITH TEXT CLASSIFICATION ALGORITHMS IN NEWS ARTICLES: SVM VS. NAIVE BAYESIAN ALGORITHM NUHI BESIMI, ADRIAN BESIMI, VISAR SHEHU DAAD: 15TH

D A A D : 1 5 T H W O R K S H O P “ S O F T WA R E E N G I N E E R I N G ED U C AT I O N A N D R E V E R S E E N G I N E E R I N G ” , B O H I N J , S LO V E N I A

Eager Learners Naïve Bayes Classifier SVM (Support Vector Machine)

Classification Techniques

6

Page 7: EXPERIMENTING WITH TEXT CLASSIFICATION ALGORITHMS IN NEWS ARTICLES: SVM VS. NAIVE BAYESIAN ALGORITHM NUHI BESIMI, ADRIAN BESIMI, VISAR SHEHU DAAD: 15TH

D A A D : 1 5 T H W O R K S H O P “ S O F T WA R E E N G I N E E R I N G ED U C AT I O N A N D R E V E R S E E N G I N E E R I N G ” , B O H I N J , S LO V E N I A

Experiment and Evaluation Testing the accuracy of the classifiers (Total news articles:

237)

Classification Techniques

Algorithm Naïve Bayes SVM

Correctly classified documents 217 178

Accuracy in % 91.5 % 75.1 %

7

Page 8: EXPERIMENTING WITH TEXT CLASSIFICATION ALGORITHMS IN NEWS ARTICLES: SVM VS. NAIVE BAYESIAN ALGORITHM NUHI BESIMI, ADRIAN BESIMI, VISAR SHEHU DAAD: 15TH

D A A D : 1 5 T H W O R K S H O P “ S O F T WA R E E N G I N E E R I N G ED U C AT I O N A N D R E V E R S E E N G I N E E R I N G ” , B O H I N J , S LO V E N I A

Experiment and Evaluation Politics news articles

(Total news articles: 49)

Technology news articles (Total news articles: 86)

Sports news articles (Total news articles: 102)

Classification Techniques (2)

Algorithm Naïve Bayes SVMCorrectly classified

documents 43 29

Accuracy in % 87.7 % 59.1 %

Algorithm Naïve Bayes SVMCorrectly classified

documents 72 86

Accuracy in % 83.7 % 100.0 %

Algorithm Naïve Bayes SVMCorrectly classified

documents 102 70

Accuracy in % 100.0 % 68.6 %

8

Page 9: EXPERIMENTING WITH TEXT CLASSIFICATION ALGORITHMS IN NEWS ARTICLES: SVM VS. NAIVE BAYESIAN ALGORITHM NUHI BESIMI, ADRIAN BESIMI, VISAR SHEHU DAAD: 15TH

D A A D : 1 5 T H W O R K S H O P “ S O F T WA R E E N G I N E E R I N G ED U C AT I O N A N D R E V E R S E E N G I N E E R I N G ” , B O H I N J , S LO V E N I A

Testing SVM only two classes? (good in some cases)

Execution time (in seconds)

Experiment and Evaluation

Politics & Technology Politics & Sports Technology & SportsNumber of

documents 135 151 188

Correctly classified

documents 120 130 149

Accuracy in % 88.8 % 87.0 % 79.2 %

Algorithm Naïve Bayes SVM

Training phase (in seconds) 612 7

Testing phase (single text document) 1.5 <0.1

9

Page 10: EXPERIMENTING WITH TEXT CLASSIFICATION ALGORITHMS IN NEWS ARTICLES: SVM VS. NAIVE BAYESIAN ALGORITHM NUHI BESIMI, ADRIAN BESIMI, VISAR SHEHU DAAD: 15TH

D A A D : 1 5 T H W O R K S H O P “ S O F T WA R E E N G I N E E R I N G ED U C AT I O N A N D R E V E R S E E N G I N E E R I N G ” , B O H I N J , S LO V E N I A

SVM (Support Vector Machine) Definitely the fastest classifier and faster training (100x

faster training than Naïve Bayesian classifier) Works very good in large datasets Works better in two class problems

Naïve Bayes Classifier Very accurate when the number of training instances is

high enough Slower comparing to SVM Larger dataset… bigger problems

Conclusion: the findings

10

Page 11: EXPERIMENTING WITH TEXT CLASSIFICATION ALGORITHMS IN NEWS ARTICLES: SVM VS. NAIVE BAYESIAN ALGORITHM NUHI BESIMI, ADRIAN BESIMI, VISAR SHEHU DAAD: 15TH

D A A D : 1 5 T H W O R K S H O P “ S O F T WA R E E N G I N E E R I N G ED U C AT I O N A N D R E V E R S E E N G I N E E R I N G ” , B O H I N J , S LO V E N I A

News Archive (way back machine?) Crawl & store news from various media in Macedonia Store the changes in the text (find the text differences) for a

given time interval Get the content, not just RSS Create Screen shots Measure similarity (plagiarism) between news sources (cosine

similarity) Visualize trends in news Use to verify the facts (Media Fact Checking Service in

Macedonia) Financially supported by Metamorphosis Foundation & USAID

(maybe)

Future Work

11

Page 12: EXPERIMENTING WITH TEXT CLASSIFICATION ALGORITHMS IN NEWS ARTICLES: SVM VS. NAIVE BAYESIAN ALGORITHM NUHI BESIMI, ADRIAN BESIMI, VISAR SHEHU DAAD: 15TH

D A A D : 1 5 T H W O R K S H O P “ S O F T WA R E E N G I N E E R I N G ED U C AT I O N A N D R E V E R S E E N G I N E E R I N G ” , B O H I N J , S LO V E N I A

Questions?

THANK YOU

EXPERIMENTING WITH TEXT CLASSIFICATION ALGORITHMS IN NEWS ARTICLES: SVM VS. NAIVE BAYESIAN ALGORITHM Nuhi BESIMI, Adrian BESIMI, Visar SHEHU

12