WordNet-based User Profiles forWordNet-based User Profiles forSemantic PersonalizationSemantic Personalization
LACAM – Knowledge Acquisition and Machine Learning LabDepartment of Computer Science – University of Bari
Giovanni Semeraro, Marco Degemmis, Pasquale Lops, Ignazio Palmisano
July 24th, 2005, Edinburgh, Scotland, UK
PIA 2005 – Workshop on New Technologies for Personalized Information Access
Final RemarksExperimentsWordNet-based profilesPersonalized AccessIntroduction
OutlineOutline
✔ Introduction
✔ Personalized Access
✔ WordNet-based profiles
✔ Experiments✔ Final Remarks
Final RemarksPersonalized AccessIntroduction
Final RemarksExperimentsWordNet-based profilesPersonalized AccessIntroduction
Today’s Information SocietyToday’s Information Society
Problems…
Explosion of irrelevant information Users overloaded by this information
…and consequences
Searching is time consumingNeed for intelligent solutions to support users
Final RemarksPersonalized AccessIntroduction
Final RemarksExperimentsWordNet-based profilesPersonalized AccessIntroduction
My…Web?My…Web? 1/21/2
Personalized Portals
& Stores
Introduction
Final RemarksExperimentsWordNet-based profilesPersonalized AccessIntroduction
Preferences
Arts & PhotographyChildren’s books
Computers & Internet
Book description at Amazon.com
content-based recommendations by learning from TEXT and users’ ratings on items
Learning User Profiles as a Text Categorization ProblemLearning User Profiles as a Text Categorization Problem
Personalized Access
Final RemarksExperimentsWordNet-based profilesPersonalized AccessIntroduction
Research Goal Research Goal
USER PROFILE: A STRUCTURED REPRESENTATION OF USER INTERESTS AND PREFERENCES
Intelligent Information Access =
3. Personalized Access by user profiles +4. Semantic Access by concept identification in
documents
Automated induction of user profiles by means of supervised machine learning techniques
Taking into account the meaning of the words
Personalized Access
Final RemarksExperimentsWordNet-based profilesPersonalized AccessIntroduction
Training documentsTraining documents
Intelligent Personalized Information AccessIntelligent Personalized Information Access
matching
Learning
Positiveexamples
Negativeexamples
Concept-basedProfile
Conceptidentification
Documents
Concept-based document
representation
Personalized Access
Final RemarksExperimentsWordNet-based profilesPersonalized AccessIntroduction
Movie Recommending on the WebMovie Recommending on the Web
Instance
(movie)Cast
1
User Ratings:0-5
Director
Keywords
Summary
Title
Tokenization + Stopword elimination + Stemming
Bag of Words (BOW)
Personalized Access
Final RemarksExperimentsWordNet-based profilesPersonalized AccessIntroduction
Word Sense Disambiguation (WSD)Word Sense Disambiguation (WSD)
Many meanings for polisemous words, known as senses
One sense at a time is used in a specific context.
Deciding which sense to use is Word Sense Disambiguation
Approaches to WSD
Knowledge-based: uses Machine Readable Dictionaries
Corpus-based: uses sense-tagged corpus
WordNet-based profiles
Final RemarksExperimentsWordNet-based profilesPersonalized AccessIntroduction
WordNetWordNet Lexical reference database whose design is inspired by current
psycholinguistic theories of human lexical memory The work started in 1985 by a group of psychologists and
linguists at Princeton University English nouns, verbs, adverbs and adjectives are organized into
SYNonym SETs, each representing one underlying lexical concept
Relations among synsets can be used to engineer a change of representation in text data by transforming vectors of words into vectors of word meanings The synonymy relation can be used to map words with similar
meanings together Hypernymy (corresponding to the IS–A relation) can be used
to generalize noun and verb meanings to a higher level of abstraction
http://www.cogsci.princeton.edu/~wn/
WordNet-based profiles
Final RemarksExperimentsWordNet-based profilesPersonalized AccessIntroduction
How to use WordNet for WSD?
Semantic similarity between synsets is inversely proportional to the distance between synsets in the WordNet IS-A hierarchyPath length similarity between synsets could be used to assign scores to synsets of a polysemous word
Synset Semantic SimilaritySynset Semantic Similarity
Placental mammal
Carnivore Rodent
Feline, felid
Cat(feline mammal)
Mouse(rodent)
1
2
3 4
5
SINSIM(cat,mouse) =-log(5/32)=0.806
Leacock-Chodorow[Fellbaum 1998]
WordNet-based profiles
Final RemarksExperimentsWordNet-based profilesPersonalized AccessIntroduction
Semantic IndexingSemantic IndexingA document d is mapped into a list of WordNet synsets following these
steps: Each monosemous word w in a slot of a document d is mapped into the
corresponding WordNet synset; For each couple of words <noun,noun> or <adjective,noun>, a search in
WordNet is made in order to verify if at least one synset exists for the bigram <w1,w2>. In the positive case, WSD algorithm is applied on the bigram, otherwise it is applied separately on w1 and w2, using all words in the slot as the context C of w;
Each polysemous unigram w is disambiguated, using all words in the slot as the context C of w.
WordNet-based profiles
Final RemarksExperimentsWordNet-based profilesPersonalized AccessIntroduction
Experimental EvaluationExperimental EvaluationExtended Eachmovie
Internet Movie Database10-fold stratified cross-validation
Precision, Recall, F-measure, NDPMMovie relevant if rating >2
Rocchio: Cosine Similarity (positive/negative profile)Experiments: BOW-generated profiles vs. BOS-generated
profiles Wilcoxon signed rank test Low number of independent trials Classification for each genre is a trial Significance level p < 0.05
Experiments
Final RemarksExperimentsWordNet-based profilesPersonalized AccessIntroduction
The EachMovie DatasetThe EachMovie Dataset
Project conducted by Compaq Research Centre (1996-1997) Dataset of user-movie ratings
About 2.8 millions ratings Over 72,000 users 1,628 items (movies) subdivided in 10 categories Discrete rating between 0 and 5 Movies content crawled from the Internet Movie Database (IMDb)
10 movie categories 933 randomly selected users 100 users for each category, only for Category 2 – Animation, 33 users
selected Each user rated between 30 and 100 movies
Experiments
Final RemarksExperimentsWordNet-based profilesPersonalized AccessIntroduction
Extended Eachmovie (ratings+content)Extended Eachmovie (ratings+content)
28.0628.0671.9471.943,7093,709ThrillerThriller1010
99
88
77
66
55
44
33
22
11
Id GenreId Genre
71.8471.84
72.97 72.97
59.89 59.89
63.71 63.71
76.24 76.24
63.46 63.46
91.73 91.73
76.21 76.21
56.67 56.67
72.05 72.05
%POS%POS
39,29839,298
3,7073,707
3,6313,631
3,8083,808
4,8804,880
4,7144,714
5,0265,026
4,2464,246
1,1031,103
4,4744,474
#Rated Movies#Rated Movies
28.1628.16
27.03 27.03 RomanceRomance
40.11 40.11 HorrorHorror
36.29 36.29 FamilyFamily
23.76 23.76 DramaDrama
36.54 36.54 ComedyComedy
8.27 8.27 ClassicClassic
23.79 23.79 Art_ForeignArt_Foreign
43.33 43.33 AnimationAnimation
27.95 27.95 ActionAction
%NEG%NEGGenreGenre
75-100% positive70-75% positive60-65% positive
Experiments
Final RemarksExperimentsWordNet-based profilesPersonalized AccessIntroduction
Bag of WordsBag of Words Bag of SynsetsBag of Synsets
11611161
……
11341134
11341134
……6767
3131
Id Id MovieMovie
11
……
22
33
……11
11
OccurrenceOccurrence
zoomzoom
……
wheelwheel
rollroll
……murdermurder
aaronaaron
Word FormWord Form
Bag of SynsetsBag of Synsets
16185511618551
20517202051720
20517202051720
67125686712568
88440218844021
Id SynsetId Synset
11611161
……
11341134
11341134
……6767
3131
Id Id MovieMovie
11
……
22
33
……11
11
OccurrenceOccurrence
zoomzoom
……
wheelwheel
RollRoll
……murdermurder
aaronaaron
Word FormWord Form
5520517202051720rollroll11341134
#Features=172.296 #Features=107.990
38% Reduction of features representing movies in the EachMovie dataset Mainly on slots containing proper names
Recognition of bigrams Synonyms represented by the same synsets
Experiments
Final RemarksExperimentsWordNet-based profilesPersonalized AccessIntroduction
Semantic Profiles EvaluationSemantic Profiles Evaluation
0.44 0.44
0.450.45
0.48 0.48
0.42 0.42
0.41 0.41
0.45 0.45
0.44 0.44
0.45 0.45
0.46 0.46
0.34 0.34
0.46 0.46
BOWBOW
NDPMNDPM
0.44 0.44
0.440.44
0.48 0.48
0.44 0.44
0.40 0.40
0.45 0.45
0.46 0.46
0.43 0.43
0.48 0.48
0.38 0.38
0.44 0.44
BOSBOS
0.73 0.73 0.67 0.67 0.82 0.82 0.74 0.74 0.69 0.69 0.64 0.64 88
0.77 0.77 0.69 0.69 0.84 0.84 0.75 0.75 0.74 0.74 0.68 0.68 77
0.77 0.77 0.74 0.74 0.81 0.81 0.79 0.79 0.76 0.76 0.73 0.73 99
0.84 0.84
0.780.78
0.81 0.81
0.70 0.70
0.94 0.94
0.84 0.84
0.63 0.63
0.79 0.79
BOSBOS
0.74 0.74
0.770.77
0.80 0.80
0.67 0.67
0.93 0.93
0.77 0.77
0.64 0.64
0.75 0.75
BOWBOW
F1F1
0.83 0.83 0.78 0.78 0.760.76 0.73 0.73 MeMeanan
0.840.840.850.850.750.750.740.741010
0.87 0.87 0.84 0.84 0.79 0.79 0.78 0.78 66
0.75 0.75 0.72 0.72 0.69 0.69 0.66 0.66 55
0.96 0.96 0.94 0.94 0.94 0.94 0.92 0.92 44
0.86 0.86 0.79 0.79 0.85 0.85 0.77 0.77 33
0.66 0.66 0.66 0.66 0.64 0.64 0.65 0.65 22
0.86 0.86 0.82 0.82 0.75 0.75 0.72 0.72 11
BOSBOSBOWBOWBOSBOSBOWBOW
RecallRecallPrecisionPrecisionIdId
Experiments
Final RemarksExperimentsWordNet-based profilesPersonalized AccessIntroduction
ResultsResults
Improvement in precision (+3%) and recall (+5%) The BOS model outperforms the BOW model specifically on datasets:
3 (+8% of precision, +7% of recall) 7 (+6% of precision, +9% of recall) 8 (+5% of precision, +8% of recall)
No improvement on dataset 2 (Animation) Low number of rated movies WSD errors (difficulty in disambiguating stories)
Experiments
Final RemarksExperimentsWordNet-based profilesPersonalized AccessIntroduction
Conclusions & Future WorksConclusions & Future Works Extending BOW to BOS improves classification accuracy when WSD in
performed on short documents Improved results are independent from the distribution of positive and
negative examples in the dataset
Integration of user profiles into UUCM [Metha et al. 2005] Ontologies and user profiles
Domain-specific ontologies
Final Remarks