17
Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer Twist : User Timeline Tweets Classifier

Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer

  • Upload
    artan

  • View
    28

  • Download
    1

Embed Size (px)

DESCRIPTION

Twist : User Timeline Tweets Classifier. Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer. Goal. Auto classify tweets on the user’s timeline into 4 predefined categories: Sports, Finance, Entertainment, Technology Input: user timeline tweets - PowerPoint PPT Presentation

Citation preview

Page 1: Team : Priya  Iyer  Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer

Team :Priya Iyer

Vaidy VenkatSonali Sharma

Mentor: Andy Schlaikjer

Twist : User Timeline Tweets Classifier

Page 2: Team : Priya  Iyer  Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer

Goal

Auto classify tweets on the user’s timeline into 4 predefined categories: Sports, Finance, Entertainment, Technology

Input: user timeline tweetsOutput: list of auto classified tweets

Page 3: Team : Priya  Iyer  Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer

Rationale

Twitter allows users to create custom Friend Lists based on the user handles.

Page 4: Team : Priya  Iyer  Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer

Rationale (contd.)

Our application is a twist on this functionality of Twitter where we auto classify tweets on the user’s timeline based on just the occurrence of terms in the tweet.

Page 5: Team : Priya  Iyer  Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer

Approach

Step 1: Data Collection Step 2: Text mining Step 3: Creation of the training file for

the library Step 4: Evaluation of several classifiers Step 5: Selecting the best classifier Step 6: Validating the classification Step 7: Tuning the parameters Step 8: Repeat; until correct

classification

Page 6: Team : Priya  Iyer  Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer

Text Mining Process

Remove special characters Tokenize Remove redundant letters in words Spell Check Stemming Language Identification Remove Stop Words Generate bigrams and change to

lower case

Page 7: Team : Priya  Iyer  Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer

Go SF Giants! Such an amaazzzing feelin’!!!! \m/ :D

SF Giants! amaazzzing feelin’!!!! \/ :D

SF Giants amaazzzing feelin

SF Giants amazing feeling

SF Giants amazing feel meSF Giants amazing feel

Stopwords

Special chars

Spell check

Stemming

stopwords

Page 8: Team : Priya  Iyer  Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer

Choice of ML technique

Logistic Regression Classifier Reasons:

Most popular linear classification technique for text classification

Ability to handle multiple categories with ease

Gave the best cross-validation accuracy and precision-recall score

Library: LIBLINEAR for Python

Page 9: Team : Priya  Iyer  Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer

Creation of LIBLINEAR training inputSF Giants amazing feel

SF – 1 Giants -2 amazing-3 feel-4

SF-1 (1) Giants-2 (1) amazing-3 (1) feel-4(1)

1 1:1 2:1 3:1 4:1

Boolean

Training Input for the SVM

Indexing

Page 10: Team : Priya  Iyer  Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer

Demo

Page 11: Team : Priya  Iyer  Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer

THANK YOU

Andy,

Marti &

The Twitter Team

Page 12: Team : Priya  Iyer  Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer

Questions?

Page 13: Team : Priya  Iyer  Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer

Data Collection Challenges – Backup Slides Collected >2000 tweets from the “Who

to follow” interest lists on Twitter for “Sports” and “Business”

Tweets were not purely “Sports” or “Business” related

Personal messages were prominent

Solution: Compared against a corpus of sports/business related terms and assigned weights accordingly

Page 14: Team : Priya  Iyer  Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer

Text Mining Challenges Noise in the data:

▪ Tweets are in inconsistent format▪ Lots of meaningless words▪ Misspellings▪ More of individual expression▪ For example, BAAAAAAAAAAAASSKEttt!!!! bskball , futball, % , :D,\m/, ^xoxo

Solution: Regular expressions and NLP toolkit

Different words, same rootPlaying , plays , playful - playSolution: Stemming

Page 15: Team : Priya  Iyer  Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer

Sample LIBLINEAR input format (Train)

Page 16: Team : Priya  Iyer  Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer

LIBLINEAR output for a test file of 20 tweets Mixed bag of sports(=1), finance(=2)

tweets, entertainment(=3) and technology (=4)

Comma separated values of the categories that each tweet

Accuracy here is 94%. Precision: 0.89 Recall: 0.89

Experiment with different kernels for a better accuracy

Page 17: Team : Priya  Iyer  Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer

Summary: Data Source/Software/Tools Category based tweets from

https://twitter.com/i/#!/who_to_follow/interests

Coding done in Python Database – sqlite3 ML tool – lib SVM Stemming – Porter’s Stemming NLP Tool kit