Upload
salil-navgire
View
513
Download
3
Embed Size (px)
Citation preview
Data Mining and Recommendation Systems
- SALIL NAVGIRE
Introduction• Discovery of models for data
• Example if the data is set of numbers then we assume that the data comes from Gaussian and model the parameters to define it completely
• Recognize meaningful patterns in data -> data miningPredict outcome from known patterns -> ML
Data Mining Techniques• Classification• Predicting the class of new item given set of
items with several classes and past instances
• Example loan approval based on decision tree classifiers
Job
Income
Job
Income Income
CarpenterEngineer Doctor
Bad Good Bad Good Bad Good
<30K
<40K
<50K
>50K
>90K
>100K
• Clustering• Clustering algorithms find group of items that
are similar
• Basically divides a dataset so that records with similar content are in the same group and group are as different as possible from each other
• K-Nearest Neighbor – a classification method that clasifies based on calculating the distances between point and other points in the training dataset
• Example Car Sales
• Regression• Deals with prediction of value rather than class
• Given x1, x2, x3….. Predict Y
• Use Linear regression and predict variables a0, a1, a2… in Y=a0+a1x1+a2x2…..
• Use Line fitting, Curve fitting methods
• Example find a relationship between smoking patients and cancer related illness
• Association Rules• These algorithms create rules that describe how
often events have occurred together
• Example when a customer buys a hammer then 90% of the time they buy nails
• Spam classification based on conditional probability
• Support is a measure of what fraction of the population satisfies both the antecedent and the consequent of the rule
• Confidence is the measure of how often the consequent is true when the antecedent is true
• Outlier Analysis• Most Data mining methods discard outliers as noise
or exceptions
• However in some applications such as fraud detection, these rare events can be more interesting
Knowledge Discovery Process• Data Collection
• Data Cleaning
• Data Integration
• Data selection
• Data transformation
• Data Mining
• Evaluation
• Knowledge presentation
Applications of Data Mining• Marketing• Analysis of consumer
behavior
• Advertising campaigns
• Targeted mailings
• Segmentation of customers, stores, or products
• Finance• Creditworthiness of clients
• Performance analysis of finance investments
• Fraud detection
• Manufacturing• Optimization of resources
• Optimization of manufacturing processes
• Product design based on customer requirements
• Health Care• Discovering patterns in X-
ray images
• Analyzing side effects of drugs
• Effectiveness of treatments
Privacy Concerns• Effective Data Mining requires large sources of data
• To achieve a wide spectrum of data, link multiple data sources
• Linking sources leads can be problematic for privacy as follows: If the following histories of a customer were linked:
• Shopping History
• Credit History
• Bank History
• Employment History
• The users life story can be painted from the collected data
Recommendation systems• Definition – RS are subclass of information filtering
systems that seek to predict the rating or preference that user would give to an item
• Enhance user experience by assisting user in finding information and reduce search and navigation time
• Increase productivity and credibility
• Decrease Long tail phenomenon
• Types of RS• Content based RS
• Collaborative filtering RS
• Hybrid RS
• Content based RS• Recommend items similar to those users
preferred in the past
• User profiling is the key
• Items/content usually denoted by keywords
• Limitations• Not all contents well represented by keywords (e.g
Images)• unrated items not shown• Users with thousands of purchases is a problem
• Example: Pandora uses properties of a song in the Music Genome Project to play similar songs
• Collaborative Filtering method• Uses other users rating for recommendation• Key is to find users/user groups whose interests
match with the current user• More users, more ratings: better results
• Limitations• Cold Start problem• Large computation power required• Sparsity
• Example: Last.fm or Spotify recommend songs based on user listening history and comparing with other users. Facebook, LinkedIn use collaborative filtering to recommend new friends and connections
• Hybrid RS• There are some cases where combining content
based and collaborative filtering are more effective
• Can overcome the sparsity and cold start problem
• Netflix Prize: offered a prize of 1 million to team that could increase the Netflix rating by 10%. The competition spanned from 2006-2009 won by BellKor's Pragmatic Chaos who used ensemble of 107 algorithms for single prediction!
• Amazon item to item collaboration• Compute similarity between item pairs
• Combine the similar items into recommendation list
• Vector corresponds to an item, and directions correspond to customers who have purchased them
• Similar items table built offline
• Measuring similarity
Examples• E-Commerce: Amazon.com, Ebay, Etsy.
• Music: Spotify, Pandora.
• Movie: Nettfilx.com, IMDB.
• News: Digg, Summly.
• Social Networks: LinkedIn, Facebook, Quora, YouTube
• Apps: Playstore, Cover