16
Categorization of users on Twitter CS-248 Data Mining Muhammad Usman Riaz | ID: 101620043 Daud Khan | ID: 101620016 Muhammad Ain Ul Hassan | ID: 101620005 Muzamil Asad | ID: 101620013 Abid Javed | ID: 101620025 Spring 2014

Data Mining Presentation - Twitter Classification

Embed Size (px)

DESCRIPTION

Data Mining Presentation - Twitter Classification

Citation preview

Page 1: Data Mining Presentation - Twitter Classification

Categorization of users on Twitter

CS-248 Data Mining

Muhammad Usman Riaz | ID: 101620043

Daud Khan | ID: 101620016

Muhammad Ain Ul Hassan | ID: 101620005

Muzamil Asad | ID: 101620013

Abid Javed | ID: 101620025

Spring 2014

Page 2: Data Mining Presentation - Twitter Classification

Outline

Introduction Pre-processing Classification Results

Page 3: Data Mining Presentation - Twitter Classification

Problem statement

Twitter is an online social networking and microblogging service that enables users to send and read short 140-character text messages, called "tweets". Registered users can read and post tweets, but unregistered users can only read them. The objective of this project includes categorization of Twitter users into different classes like company or individual, professional or home user, sportsman, student, teacher etc.

Page 4: Data Mining Presentation - Twitter Classification

Dataset Raw dataset

(a) Raw data

(b) Data organization

Page 5: Data Mining Presentation - Twitter Classification

Attributes of ‘category’

Page 6: Data Mining Presentation - Twitter Classification

Pre-processing

Conversion to ARFF format Removal of unnecessary attributes. Tweets (strings) converted into words

(using weka “StringtoWordVector” filter)

Removal of stop words (are, as, at etc)

Page 7: Data Mining Presentation - Twitter Classification

Training data after pre-processing

Page 8: Data Mining Presentation - Twitter Classification

Classification

Conversion of test data to ARFF format using batch filtering.

Batch filtering is used if a second dataset, normally the test set, needs to be processed with the same statistics as the the first dataset, normally the training set.

Page 9: Data Mining Presentation - Twitter Classification

Classification

Classification using supplied test data-set

Page 10: Data Mining Presentation - Twitter Classification

Results

NaiveBayes Naive Bayes classifiers are a family

of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features.

Page 11: Data Mining Presentation - Twitter Classification

ResultsClassification using NaiveBayes

Page 12: Data Mining Presentation - Twitter Classification

Classifier errors using NaiveBayes

X: Category Y: Predicted Category

Page 13: Data Mining Presentation - Twitter Classification

ResultsClassification using SMO

Sequential Minimal Optimization (SMO) is an algorithm for efficiently solving the optimization problem which arises during the training of support vector machines.

Page 14: Data Mining Presentation - Twitter Classification

ResultsClassification using SMO

Page 15: Data Mining Presentation - Twitter Classification

Conclusion

SMO is a simple algorithm with high classification accuracy for our dataset.

It shows high performance with balanced distribution training data as input.

Page 16: Data Mining Presentation - Twitter Classification

ThanksQuestion?