Upload
artan
View
28
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Twist : User Timeline Tweets Classifier. Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer. Goal. Auto classify tweets on the user’s timeline into 4 predefined categories: Sports, Finance, Entertainment, Technology Input: user timeline tweets - PowerPoint PPT Presentation
Citation preview
Team :Priya Iyer
Vaidy VenkatSonali Sharma
Mentor: Andy Schlaikjer
Twist : User Timeline Tweets Classifier
Goal
Auto classify tweets on the user’s timeline into 4 predefined categories: Sports, Finance, Entertainment, Technology
Input: user timeline tweetsOutput: list of auto classified tweets
Rationale
Twitter allows users to create custom Friend Lists based on the user handles.
Rationale (contd.)
Our application is a twist on this functionality of Twitter where we auto classify tweets on the user’s timeline based on just the occurrence of terms in the tweet.
Approach
Step 1: Data Collection Step 2: Text mining Step 3: Creation of the training file for
the library Step 4: Evaluation of several classifiers Step 5: Selecting the best classifier Step 6: Validating the classification Step 7: Tuning the parameters Step 8: Repeat; until correct
classification
Text Mining Process
Remove special characters Tokenize Remove redundant letters in words Spell Check Stemming Language Identification Remove Stop Words Generate bigrams and change to
lower case
Go SF Giants! Such an amaazzzing feelin’!!!! \m/ :D
SF Giants! amaazzzing feelin’!!!! \/ :D
SF Giants amaazzzing feelin
SF Giants amazing feeling
SF Giants amazing feel meSF Giants amazing feel
Stopwords
Special chars
Spell check
Stemming
stopwords
Choice of ML technique
Logistic Regression Classifier Reasons:
Most popular linear classification technique for text classification
Ability to handle multiple categories with ease
Gave the best cross-validation accuracy and precision-recall score
Library: LIBLINEAR for Python
Creation of LIBLINEAR training inputSF Giants amazing feel
SF – 1 Giants -2 amazing-3 feel-4
SF-1 (1) Giants-2 (1) amazing-3 (1) feel-4(1)
1 1:1 2:1 3:1 4:1
Boolean
Training Input for the SVM
Indexing
Demo
THANK YOU
Andy,
Marti &
The Twitter Team
Questions?
Data Collection Challenges – Backup Slides Collected >2000 tweets from the “Who
to follow” interest lists on Twitter for “Sports” and “Business”
Tweets were not purely “Sports” or “Business” related
Personal messages were prominent
Solution: Compared against a corpus of sports/business related terms and assigned weights accordingly
Text Mining Challenges Noise in the data:
▪ Tweets are in inconsistent format▪ Lots of meaningless words▪ Misspellings▪ More of individual expression▪ For example, BAAAAAAAAAAAASSKEttt!!!! bskball , futball, % , :D,\m/, ^xoxo
Solution: Regular expressions and NLP toolkit
Different words, same rootPlaying , plays , playful - playSolution: Stemming
Sample LIBLINEAR input format (Train)
LIBLINEAR output for a test file of 20 tweets Mixed bag of sports(=1), finance(=2)
tweets, entertainment(=3) and technology (=4)
Comma separated values of the categories that each tweet
Accuracy here is 94%. Precision: 0.89 Recall: 0.89
Experiment with different kernels for a better accuracy
Summary: Data Source/Software/Tools Category based tweets from
https://twitter.com/i/#!/who_to_follow/interests
Coding done in Python Database – sqlite3 ML tool – lib SVM Stemming – Porter’s Stemming NLP Tool kit