7
M ovie Recom m endation System UsingTw itterData Recom m endingM oviesto a UserBased on hisTw itterHandle “This is your life and it’s sending one minute at a time”

courseprojects.souravsengupta.comcourseprojects.souravsengupta.com/wp-content/... · Web viewSpeech analysis-Using twitter to generate a word cloud and from that determine a ... these

  • Upload
    lamngoc

  • View
    216

  • Download
    3

Embed Size (px)

Citation preview

Page 1: courseprojects.souravsengupta.comcourseprojects.souravsengupta.com/wp-content/... · Web viewSpeech analysis-Using twitter to generate a word cloud and from that determine a ... these

Movie Recommendation System Using Twitter DataRecommending Movies to a User Based on his Twitter Handle

“This is your life and it’s sending one minute at a time”

Page 2: courseprojects.souravsengupta.comcourseprojects.souravsengupta.com/wp-content/... · Web viewSpeech analysis-Using twitter to generate a word cloud and from that determine a ... these

Twitter is one of the most popular social networking websites in the world. As of the third quarter of 2015, it has around 307 million active users. Every second, on average, around 6,000 tweets are tweeted, which corresponds to over 350,000 tweets sent per minute, 500 million tweets per day and around 200 billion tweets per year. The large amount of data, so available, can be and has been used for providing a number of services, like

• Speech analysis-Using twitter to generate a word cloud and from that determine a person’s style of speech. It can also be used to find the topics of interest, etc.

• Successes of a product or campaign-To analyze the popularity and the reviews of a product

• Detection of earthquake

• To predict the social unrest and mass protest

The project was based on understanding the semantics of the tweets made by the user and then recommending movies to the user on the basis of words used.

Methodology

The project was divided into four parts starting from extracting Twitter data and building genre dictionaries; these two steps ran in tandem. This was followed by building movie list from which to recommend movies, and the final step being developing the mapping algorithm.

Extracting Twitter Data

Tweets from a particular Twitter handle are captures using function – userTimeline. Most of the times, tweets captured may not be proper English words, hence the text is cleaned and common words such as helping verbs, prepositions, etc. are removed from the text.

Then, we created a term document matrix of all the words and words which appear at least thrice were kept. Then, these words were reduced to their stems and common stem were added and assigned weights.

For assigning weights, TF-IDF method was used. Term Frequency (TF) captures the raw frequency (count of word in a document) of the word, while Inverse Document Frequency (IDF) captures the relative importance of every word in a document. Final weight of a word would be product of TF score and IDF score.

Page 3: courseprojects.souravsengupta.comcourseprojects.souravsengupta.com/wp-content/... · Web viewSpeech analysis-Using twitter to generate a word cloud and from that determine a ... these

Building genre dictionaries

Movie dictionary refers to a list of words that could represent a particular genre of movies. In order to build a movie genre dictionary, we collected data from 4 different sources. The first two included movie description and movie plots from IMDB and Wikipedia. Third source included key/tag words for every movie from themoviedatabase.com(tmdb.com). The final step was manual intervention. A list of words was added to every genre which could represent that genre.

Action Animation Comedymutant infant singlesboxer anime studentarmy disney romancehero magic comedymarvel pixar collegemission castle amnesiapirate walt datehero cartoon firemenbond joy lieconfidentialchild speed

250 Movie Description

mutant

boxermarvel

missionrevenge hero

bond

Mapping these words to specific genres, which

they represent

disney

dateTag words for every movie

Manually selecting

related words

Clean Keep Frequent words Word Stem

Then, weight to every word was assigned in the same manner as we did for Twitter data.

𝑤𝑒𝑖𝑔ℎ𝑡,𝑡𝑤,𝑑 = 𝑇𝐹× 𝐼𝐷𝐹

𝑇𝐹= ൜1+ 𝑙𝑜𝑔10𝑡𝑤,𝑑 , 𝑡𝑤,𝑑 > 00 , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

𝐼𝐷𝐹= ሼ𝑙𝑜𝑔10(𝑡𝑜𝑡𝑎𝑙 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡/𝑡𝑤,𝑑 ), 𝑡𝑤,𝑑 > 0}

Building Movie List

A list of around 1800 movies was developed. Every movie could fall into 2 to 4 genres out of the 11 major genres selected. A sample list of movies with all the genres is provided below. Binary representation was used to indicate the presence of a movie in a particular genre, either a movie falls into a genre or it doesn’t.

ID Name Action Animation Mystery Comedy Drama Family Horror Romance Thriller Adventure SciFi1 State of the World 0 0 0 0 1 0 0 0 0 0 02 Garage 0 0 0 0 1 0 0 0 0 0 03 Frank 0 0 1 1 1 0 0 0 0 0 04 Mission: Impossible III 1 0 0 0 0 0 0 0 1 1 05 Star Trek 1 0 0 0 0 0 0 0 0 1 16 Rana's Wedding 0 0 0 1 1 0 0 0 0 0 07 Paradise Now 0 0 0 0 1 0 0 0 0 0 08 Omar 0 0 0 0 1 0 0 0 1 0 0

Finding User’s preference of genre

Every word in the user dictionary was matched with every word in all the 11 genre dictionaries. For every word that matched, corresponding scores were multiplied. This gave us score for every genre.

Page 4: courseprojects.souravsengupta.comcourseprojects.souravsengupta.com/wp-content/... · Web viewSpeech analysis-Using twitter to generate a word cloud and from that determine a ... these

Action Weightmutant 0.002355boxer 0.023459army 0.123588hero 0.007896marvel 0.003698mission 0.087352pirate 0.314879hero 0.361255bond 0.098763confidential 0.198756

UserTweets Weighthero 0.002589marvel 0.238937mission 0.025734comedy 0.006662college 0.114809joy 0.302830child 0.217025date 0.472366day 0.098997fight 0.198521

ActionAnimationMysteryComedyDramaFamilyHorrorRomanceThrillerAdventureSciFi

11 Dictionaries, one for each genre

List of words from user’s Twitter Handle

Finding user’s genre preference by generating

score for every genre

Genre User's Scoreaction 31.2888101animation 31.4013252mystery 31.3814179comedy 31.4627622drama 30.9160581family 31.0038619horror 30.6097828romance 31.2182833thriller 31.0338976adventure 30.7779819scifi 30.8852927

Mapping Algorithm

1. Take dot product of movie matrix (1814 x 11) with the score vector (11 x 1) 2. Get movie score vector (1814 x 1) having score for every movie3. Recommend movie at the top of the list

ۏێێێۍ

𝐴𝑐𝑡𝑖𝑜𝑛 31.288⋮ ⋮𝐶𝑜𝑚𝑒𝑑𝑦 31.462⋮ ⋮𝑆𝑐𝑖𝑓𝑖 ۑۑے30.885ۑې

ۏێێێێێێێۍ

𝐷𝑒𝑎𝑡ℎ 𝑅𝑎𝑐𝑒 1 0 1 0 ⋯ 1⋮ ⋮𝑂𝑙𝑑𝐵𝑜𝑦 0 1 1 0 ⋯ 1⋮ ⋮𝐾𝑢𝑛𝑔𝐹𝑢 𝑃𝑎𝑛𝑑𝑎 1 0 1 1 ⋯ 0⋮ ⋮𝐵𝑙𝑜𝑜𝑑 𝐷𝑖𝑎𝑚𝑜𝑛𝑑 0 0 1 1 ⋯ 1⋮ ⋮𝐵𝑒𝑜𝑤𝑢𝑙𝑓 ۑے0 ⋯ 1 1 0 1ۑۑۑۑۑۑې

1814 × 11

11 × 1

Name ScoreManchurian Candidate, The 95.00495Bourne Supremacy, The 95.00495Paprika 95.00495Scanner Darkly, A 91.59114Inception 91.59114Oldboy 91.59114Watchmen 89.54713Death Race 89.54713Lucy 89.54713Titan A.E. 88.17734

Outcome and Next Steps

The project's output could be used to identify user's personality in terms of his liking/disliking.

Since the choice of movies for a particular user could be associated with many other things such as books, places he would like to visit, among many others, this could be further build upon.

It can serve a perfect tool for online marketing based on user's

social media profile. Some of the improvements that could be done are as follows:

Page 5: courseprojects.souravsengupta.comcourseprojects.souravsengupta.com/wp-content/... · Web viewSpeech analysis-Using twitter to generate a word cloud and from that determine a ... these

Improving Semantic and associations of words in User profileo There exist different words which imply same meaning. Moreover, even while

creating term document matrix, different forms of a same words, or even singular plural are considered different words. We tried to minimize this by using the stem of the word, instead of using the word itself. This improved the accuracy to some extent. The next step to this would be to find the intensity of a word and group same meaning words together and then determine user's personality.

Increasing number of genreso Considering the scope of the project and paucity of time, we considered only 11

genres which could broadly classify most of the movies. Netflix uses around 77,000 micro-genres to classify movies. More the number of genres, better the classification and recommendation will become.

Find association of Movieso So far we tried finding association among various aspects of a user's profile. The

same can be done for movies. Each movie could be tagged with few key words and other characteristics such as time of release, cast, actor-director pair, geography, etc. This can then be used to cluster movies and improve the recommendation process.

Including other aspects of Twitter Profileo In the project, as it is, we have considered only the Tweets made by the user. This

could be complemented by other aspects such as hashtags, handles that a person is following, retweets, etc. This would enable us in gathering more information about the user.

Challenges

Extracting legible data from Twittero As ironical as it may sound, social media are a store house of data, but only a small

fraction of that data can be used to extract useful information.o Moreover, the usage of slangs, incorrect and misspelled English words make it

difficult to use the extracted data Building genre dictionary

o There are many movies which fall into different categories, and hence a word that represents a particular genre could also fall into different genres. Estimating the weights of such words so as to distinguish their presence in different genres was a challenge. This was taken care of manually adding set of words to every genre.

Lack of training and test datao This problem, being exclusive to the project, was the most difficult to handle. Since,

there is no direct literature available related to this project. It was a difficult task to find out whether our algorithm is working fine. We took a sample of few and asked their preferences for movies. Then, we found out movies based on their Twitter handle and compared their preferences. We found out significant overlap between the two lists.

R Packages Used

twitteR SnowballC sqldf tm

Page 6: courseprojects.souravsengupta.comcourseprojects.souravsengupta.com/wp-content/... · Web viewSpeech analysis-Using twitter to generate a word cloud and from that determine a ... these

R Functions Used

userTimeline twListToDF regmatches TermDocumentMatrix tm_map Corpus wordStem sqldf setup_twitter_oauth

Thank You

PGDBA Group 08Bharathi R ([email protected])Faichali Basumatary ([email protected]) Shoorvir Gupta ([email protected]) Vishal Bagla([email protected])