Twist: User Timeline Tweets Classifier - University of ... priya.iyer/twist/Twist Final...Twist: User…

  • View
    213

  • Download
    1

Embed Size (px)

Transcript

  • Twist:

    User Timeline Tweets Classifier By

    Sonali Sharma

    Priya Iyer

    Vaidyanath Venkat

    Project Report

    Report presented towards the completion of the class project for INFO 290 Analyzing Big Data

    with Twitter

    Date: 12/10/2012.

    Mentor:

    Andy Schlaikjer

    Instructor

    Marti Hearst

  • 1. Abstract

    Social media plays a big role in our day-to day lives. Over the last few years, micro-blogging services like

    Twitter have become a great source of information from friends, celebrities, organizations and a means for

    building social networks. These services are being increasingly used for real-time information sharing,

    news and recommendations.

    This report describes the creation, implementation and evaluation of a classifier, trained using supervised

    machine learning techniques, which takes tweets as input, and classifies them into a set of predefined

    categories. These categories are Sports, Finance, Technology and Entertainment. In this report, we discuss

    the goals of this project, the necessary data collected to train and test the classifier. We further evaluate

    the options used; provide details of implementation and evaluation techniques used for this classifier. The

    output of the classifier is shown with the help of a wen interface.

    2. Project Goals

    Currently twitter allows users to create custom lists based on user profiles. This is a personalized list

    where the users classify Twitter profiles in one or more categories. E.g. @newyorktimes, @cnn,

    @guardian etc. are usually classified under News List. Similarly, one would classify @SportsCenter,

    @TwitterSports, @NBCSports, @SkySportsNews under Sports list. Generally, these lists are based on the

    twitter handles and not on the text of the tweet per se. The goal of our application is to:

    a) Analyze the tweets, process them to remove redundant information and then employ machine

    learning techniques to automatically classify the tweets under one or more predefined categories

    based on one or more features of the tweets.

    b) Create a web interface for the users to login using their twitter account and view the tweets from their

    home timeline classified under on these categories of Sports, Technology, Finance, Entertainment and

    Others.

    3. Project Strategy

    We divided the project into phases. The tasks under each of the phases were divided amongst the team

    members based on preferences and skill sets. The next step was to do a literature review to analyze

    previous work and to identify the best practices. We then divided our project into logical phases.

    Phase 1:

    Create an outline of the application, identify software needs, plan the application architecture

  • Data Collection, data consisted of tweets from handles under Sports, Technology, Finance and

    Entertainment categories on the Who to Follow page:

    https://twitter.com/i/#!/who_to_follow/interests

    Text Mining: Creation of custom algorithms to mine the text from the tweets and remove noise

    Creation of training and test files in the classifier library specific format

    Creation of algorithms to train and test the classifier with metrics for evaluation

    Phase 2:

    Measure effectiveness of the classifier using precision, recall and cross validation.

    Refine the classifier using more training set and features.

    Create the web interface running on a Flask server using Python script for users to view classified

    tweets

    Code documentation and review

    Preparation of final project report and presentation.

    4. Implementation

    4.1 Overview

    To create a classifier we collected over 100,000 tweets belonging to sports, technology, finance and

    entertainment. The implementation steps were divided into backend (data collection, processing of

    tweets and classifier module) and frontend (to show classified tweets).

    4.2 Technologies Used

    Tweepy

    Tweepy is a Python library for accessing the Twitter API. We implemented the authorization and

    streaming modules of the live tweets using this package.

    Twitter 4j API V.1.0

    The Twitter REST API methods allow developers to access core Twitter data. This includes update

    timelines, status data, and user information. For the purpose of data collection we used Twitter 4j

    API version 1.0, implemented in Java, to collect tweets from a list of users.

    Python 2.7.3

  • Preprocessing of the tweets, which consisted of cleaning the tweets by removal of special

    characters, spell checks, stemming, removal of stop words, tokenization, bigrams was done in

    python 2.7.3. Spell checking and stop-words removal was implemented using the NLTK toolkit.

    LIBLINEAR

    The library used for creating different classifier models was LIBLINEAR for Python.

    4.3 Data Collection

    In order to train the classifier we collected over 25000 tweets from each of the four categories

    using Twitter4j REST API V.1.1. To ensure that relevant tweets are collected, we got a list of the top

    15-20 most influential users for each category from Twitters recommended Who to Follow list

    (https://twitter.com/i/#!/who_to_follow/interests) as shown in Figure 1.

    Figure 1. Who To Follow suggestion lists on Twitter

    We collected over 2000 tweets per user under each category into four separate text files. List of

    selected users for each category is described in Appendix A.

    https://twitter.com/i/%23!/who_to_follow/interests

  • Note: In order to address the problem of rate limiting, tweets were collected in regular intervals as

    opposed to a one-time run. Paging was used which allows 200 tweets to be collected per page from

    each user.

    4.4 Text Mining

    Before using the Twitter data as training set, it was very important to pre-process the data to

    extract meaningful information. Twitter data consists of lots noise, which must be removed.

    Tweets are in an inconsistent format, they contain a lot of special characters, abbreviations, @ for

    mentions, # for tags, misspelt words, URLs and exclamatory words. We identified the following

    rules and built algorithms for each of these to clean and preprocess data in order to create a

    relevant training set.

    Rule#1: Take only English language words.

    Rule#2: Remove all special characters except hashtags and URLs

    Rule#3: Correct words containing repeated letters e.g. sooooo good, changed to so good.

    Rule#4: Apply spell check on words using Norwicks spell check algorithm

    Rule#5: Remove stop words

    Rule#6: Apply stemming to change the word to its root. Porters stemming algorithm was use for

    this purpose.

    Rule#7: Convert words to lower case.

    Rule#8: Tokenize words using the whitespace and create bigrams.

    We wrote methods in Python to execute the rules defined above and create a clean dataset for

    training and testing purposes.

    This is illustrated in Figure 2.

  • Figure 2: An illustration of the flow of Text Processing Module

    4.5 Classifier

    4.5.1 Rationale

    For the purposes of text classification, we made use of multiclass linear classifiers. From the

    literature review, it was clear that in particular, the most common technique in practice has

    been to build one-versus-rest classifiers (commonly referred to as one-versus-all'' or OVA

    classification), and to choose the class which classifies the test datum with greatest margin.

    Thus, our classifier had to classify into: Sports, Finance, Entertainment, Technology and Others.

    For this purpose, we implemented four One vs. Rest classifiers (commonly referred to as

    one-versus-all or OVA classification), one for each category.

    Each tweet is passed through each of these classifiers. The output of each of these binary

    classifiers would be whether the tweet belongs to the category that classifier is trained for. Thus,

    after the tweet has passed through all the four classifiers, we will know the categories tweet

    belongs to. If it doesnt belong to any of the categories, it will be classified under

    Others.

    We used the LIBLINEAR library for python to implement linear classifiers.

    LIBLINEAR is a linear classifier for data with millions of instances and features.

  • It supports

    L2-regularized classifier

    L2-loss linear SVM, L1-loss linear SVM, and logistic regression (LR)

    L1-regularized classifiers

    L2-loss linear SVM and logistic regression (LR)

    L2-regularized support vector regression

    L2-loss linear SVR and L1-loss linear SVR.

    4.5.2 Choice of the Classifier

    The choice of the classifier was crucial. From the literature review, it was clear to us that the

    two best machine learning techniques for text classification would be Logistic Regression and

    SVM (Support Vector Machines). Once we narrowed down to the use of the library, the next

    challenge for us was to determine the exact classifier that would be the most effective in

    classifying tweets. We conducted an experiment to choose the best classifier type which is

    explained in detail in the section 4.5.3.

    4.5.3 Preparing the training set

    From the text processing module as discussed in Section 4.4, we created four different

    training sets for each of the four classifiers. We created a custom algorithm to convert each of

    the tweets from tex