17
Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer Twist : User Timeline Tweets Classifier

Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer Twist : User Timeline Tweets Classifier

Embed Size (px)

Citation preview

Page 1: Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer Twist : User Timeline Tweets Classifier

Team :Priya Iyer

Vaidy VenkatSonali Sharma

Mentor: Andy Schlaikjer

Twist : User Timeline Tweets Classifier

Page 2: Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer Twist : User Timeline Tweets Classifier

Goal

Auto classify tweets on the user’s timeline into 4 predefined categories: Sports, Finance, Entertainment, Technology

Input: user timeline tweetsOutput: list of auto classified tweets

Page 3: Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer Twist : User Timeline Tweets Classifier

Rationale

Twitter allows users to create custom Friend Lists based on the user handles.

Page 4: Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer Twist : User Timeline Tweets Classifier

Rationale (contd.)

Our application is a twist on this functionality of Twitter where we auto classify tweets on the user’s timeline based on just the occurrence of terms in the tweet.

Page 5: Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer Twist : User Timeline Tweets Classifier

Approach

Step 1: Data Collection Step 2: Text mining Step 3: Creation of the training file for

the library Step 4: Evaluation of several classifiers Step 5: Selecting the best classifier Step 6: Validating the classification Step 7: Tuning the parameters Step 8: Repeat; until correct

classification

Page 6: Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer Twist : User Timeline Tweets Classifier

Text Mining Process

Remove special characters Tokenize Remove redundant letters in words Spell Check Stemming Language Identification Remove Stop Words Generate bigrams and change to

lower case

Page 7: Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer Twist : User Timeline Tweets Classifier

Go SF Giants! Such an amaazzzing feelin’!!!! \m/ :D

SF Giants! amaazzzing feelin’!!!! \/ :D

SF Giants amaazzzing feelin

SF Giants amazing feeling

SF Giants amazing feel me

SF Giants amazing feel

Stopwords

Special chars

Spell check

Stemming

stopwords

Page 8: Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer Twist : User Timeline Tweets Classifier

Choice of ML technique

Logistic Regression Classifier Reasons:

Most popular linear classification technique for text classification

Ability to handle multiple categories with ease

Gave the best cross-validation accuracy and precision-recall score

Library: LIBLINEAR for Python

Page 9: Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer Twist : User Timeline Tweets Classifier

Creation of LIBLINEAR training inputSF Giants amazing feel

SF – 1 Giants -2 amazing-3 feel-4

SF-1 (1) Giants-2 (1) amazing-3 (1) feel-4(1)

1 1:1 2:1 3:1 4:1

Boolean

Training Input for the SVM

Indexing

Page 10: Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer Twist : User Timeline Tweets Classifier

Demo

Page 11: Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer Twist : User Timeline Tweets Classifier

THANK YOU

Andy,

Marti &

The Twitter Team

Page 12: Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer Twist : User Timeline Tweets Classifier

Questions?

Page 13: Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer Twist : User Timeline Tweets Classifier

Data Collection Challenges – Backup Slides

Collected >2000 tweets from the “Who to follow” interest lists on Twitter for “Sports” and “Business”

Tweets were not purely “Sports” or “Business” related

Personal messages were prominent

Solution: Compared against a corpus of sports/business related terms and assigned weights accordingly

Page 14: Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer Twist : User Timeline Tweets Classifier

Text Mining Challenges

Noise in the data:▪ Tweets are in inconsistent format▪ Lots of meaningless words▪ Misspellings▪ More of individual expression▪ For example, BAAAAAAAAAAAASSKEttt!!!!

bskball , futball, % , :D,\m/, ^xoxo

Solution: Regular expressions and NLP toolkit

Different words, same rootPlaying , plays , playful - playSolution: Stemming

Page 15: Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer Twist : User Timeline Tweets Classifier

Sample LIBLINEAR input format (Train)

Page 16: Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer Twist : User Timeline Tweets Classifier

LIBLINEAR output for a test file of 20 tweets

Mixed bag of sports(=1), finance(=2) tweets, entertainment(=3) and technology (=4)

Comma separated values of the categories that each tweet

Accuracy here is 94%. Precision: 0.89 Recall: 0.89

Experiment with different kernels for a better accuracy

Page 17: Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer Twist : User Timeline Tweets Classifier

Summary: Data Source/Software/Tools

Category based tweets from https://twitter.com/i/#!/who_to_follow

/interests Coding done in Python Database – sqlite3 ML tool – lib SVM Stemming – Porter’s Stemming NLP Tool kit