Download pdf - PIA 2005 - Home - Florida Tech Department of Computer Sciencespkc/apweb/related/semeraro05pia.pdf · Intelligent Information Access = 3. Personalized Access by user profiles + 4

WordNet-based User Profiles forWordNet-based User Profiles forSemantic PersonalizationSemantic Personalization

LACAM – Knowledge Acquisition and Machine Learning LabDepartment of Computer Science – University of Bari

Giovanni Semeraro, Marco Degemmis, Pasquale Lops, Ignazio Palmisano

July 24th, 2005, Edinburgh, Scotland, UK

PIA 2005 – Workshop on New Technologies for Personalized Information Access

Final RemarksExperimentsWordNet-based profilesPersonalized AccessIntroduction

OutlineOutline

✔ Introduction

✔ Personalized Access

✔ WordNet-based profiles

✔ Experiments✔ Final Remarks

Final RemarksPersonalized AccessIntroduction


Today’s Information SocietyToday’s Information Society

Problems…

Explosion of irrelevant information Users overloaded by this information

…and consequences

Searching is time consumingNeed for intelligent solutions to support users

Final RemarksPersonalized AccessIntroduction


My…Web?My…Web? 1/21/2

Personalized Portals

& Stores

Introduction


Preferences

Arts & PhotographyChildren’s books

Computers & Internet

Book description at Amazon.com

content-based recommendations by learning from TEXT and users’ ratings on items

Learning User Profiles as a Text Categorization ProblemLearning User Profiles as a Text Categorization Problem

Personalized Access


Research Goal Research Goal

USER PROFILE: A STRUCTURED REPRESENTATION OF USER INTERESTS AND PREFERENCES

Intelligent Information Access =

3. Personalized Access by user profiles +4. Semantic Access by concept identification in

documents

Automated induction of user profiles by means of supervised machine learning techniques

Taking into account the meaning of the words

Personalized Access


Training documentsTraining documents

Intelligent Personalized Information AccessIntelligent Personalized Information Access

matching

Learning

Positiveexamples

Negativeexamples

Concept-basedProfile

Conceptidentification

Documents

Concept-based document

representation

Personalized Access


Movie Recommending on the WebMovie Recommending on the Web

Instance

(movie)Cast

1

User Ratings:0-5

Director

Keywords

Summary

Title

Tokenization + Stopword elimination + Stemming

Bag of Words (BOW)

Personalized Access


Word Sense Disambiguation (WSD)Word Sense Disambiguation (WSD)

Many meanings for polisemous words, known as senses

One sense at a time is used in a specific context.

Deciding which sense to use is Word Sense Disambiguation

Approaches to WSD

Knowledge-based: uses Machine Readable Dictionaries

Corpus-based: uses sense-tagged corpus

WordNet-based profiles


WordNetWordNet Lexical reference database whose design is inspired by current

psycholinguistic theories of human lexical memory The work started in 1985 by a group of psychologists and

linguists at Princeton University English nouns, verbs, adverbs and adjectives are organized into

SYNonym SETs, each representing one underlying lexical concept

Relations among synsets can be used to engineer a change of representation in text data by transforming vectors of words into vectors of word meanings The synonymy relation can be used to map words with similar

meanings together Hypernymy (corresponding to the IS–A relation) can be used

to generalize noun and verb meanings to a higher level of abstraction

http://www.cogsci.princeton.edu/~wn/



How to use WordNet for WSD?

Semantic similarity between synsets is inversely proportional to the distance between synsets in the WordNet IS-A hierarchyPath length similarity between synsets could be used to assign scores to synsets of a polysemous word

Synset Semantic SimilaritySynset Semantic Similarity

Placental mammal

Carnivore Rodent

Feline, felid

Cat(feline mammal)

Mouse(rodent)

1

2

3 4

5

SINSIM(cat,mouse) =-log(5/32)=0.806

Leacock-Chodorow[Fellbaum 1998]



Semantic IndexingSemantic IndexingA document d is mapped into a list of WordNet synsets following these

steps: Each monosemous word w in a slot of a document d is mapped into the

corresponding WordNet synset; For each couple of words <noun,noun> or <adjective,noun>, a search in

WordNet is made in order to verify if at least one synset exists for the bigram <w1,w2>. In the positive case, WSD algorithm is applied on the bigram, otherwise it is applied separately on w1 and w2, using all words in the slot as the context C of w;

Each polysemous unigram w is disambiguated, using all words in the slot as the context C of w.



Experimental EvaluationExperimental EvaluationExtended Eachmovie

Internet Movie Database10-fold stratified cross-validation

Precision, Recall, F-measure, NDPMMovie relevant if rating >2

Rocchio: Cosine Similarity (positive/negative profile)Experiments: BOW-generated profiles vs. BOS-generated

profiles Wilcoxon signed rank test Low number of independent trials Classification for each genre is a trial Significance level p < 0.05

Experiments


The EachMovie DatasetThe EachMovie Dataset

Project conducted by Compaq Research Centre (1996-1997) Dataset of user-movie ratings

About 2.8 millions ratings Over 72,000 users 1,628 items (movies) subdivided in 10 categories Discrete rating between 0 and 5 Movies content crawled from the Internet Movie Database (IMDb)

10 movie categories 933 randomly selected users 100 users for each category, only for Category 2 – Animation, 33 users

selected Each user rated between 30 and 100 movies

Experiments


Extended Eachmovie (ratings+content)Extended Eachmovie (ratings+content)

28.0628.0671.9471.943,7093,709ThrillerThriller1010

99

88

77

66

55

44

33

22

11

Id GenreId Genre

71.8471.84

72.97 72.97

59.89 59.89

63.71 63.71

76.24 76.24

63.46 63.46

91.73 91.73

76.21 76.21

56.67 56.67

72.05 72.05

%POS%POS

39,29839,298

3,7073,707

3,6313,631

3,8083,808

4,8804,880

4,7144,714

5,0265,026

4,2464,246

1,1031,103

4,4744,474

#Rated Movies#Rated Movies

28.1628.16

27.03 27.03 RomanceRomance

40.11 40.11 HorrorHorror

36.29 36.29 FamilyFamily

23.76 23.76 DramaDrama

36.54 36.54 ComedyComedy

8.27 8.27 ClassicClassic

23.79 23.79 Art_ForeignArt_Foreign

43.33 43.33 AnimationAnimation

27.95 27.95 ActionAction

%NEG%NEGGenreGenre

75-100% positive70-75% positive60-65% positive

Experiments


Bag of WordsBag of Words Bag of SynsetsBag of Synsets

11611161

……

11341134

11341134

……6767

3131

Id Id MovieMovie

11

……

22

33

……11

11

OccurrenceOccurrence

zoomzoom

……

wheelwheel

rollroll

……murdermurder

aaronaaron

Word FormWord Form

Bag of SynsetsBag of Synsets

16185511618551

20517202051720

20517202051720

67125686712568

88440218844021

Id SynsetId Synset

11611161

……

11341134

11341134

……6767

3131

Id Id MovieMovie

11

……

22

33

……11

11

OccurrenceOccurrence

zoomzoom

……

wheelwheel

RollRoll

……murdermurder

aaronaaron

Word FormWord Form

5520517202051720rollroll11341134

#Features=172.296 #Features=107.990

38% Reduction of features representing movies in the EachMovie dataset Mainly on slots containing proper names

Recognition of bigrams Synonyms represented by the same synsets

Experiments


Semantic Profiles EvaluationSemantic Profiles Evaluation

0.44 0.44

0.450.45

0.48 0.48

0.42 0.42

0.41 0.41

0.45 0.45

0.44 0.44

0.45 0.45

0.46 0.46

0.34 0.34

0.46 0.46

BOWBOW

NDPMNDPM

0.44 0.44

0.440.44

0.48 0.48

0.44 0.44

0.40 0.40

0.45 0.45

0.46 0.46

0.43 0.43

0.48 0.48

0.38 0.38

0.44 0.44

BOSBOS

0.73 0.73 0.67 0.67 0.82 0.82 0.74 0.74 0.69 0.69 0.64 0.64 88

0.77 0.77 0.69 0.69 0.84 0.84 0.75 0.75 0.74 0.74 0.68 0.68 77

0.77 0.77 0.74 0.74 0.81 0.81 0.79 0.79 0.76 0.76 0.73 0.73 99

0.84 0.84

0.780.78

0.81 0.81

0.70 0.70

0.94 0.94

0.84 0.84

0.63 0.63

0.79 0.79

BOSBOS

0.74 0.74

0.770.77

0.80 0.80

0.67 0.67

0.93 0.93

0.77 0.77

0.64 0.64

0.75 0.75

BOWBOW

F1F1

0.83 0.83 0.78 0.78 0.760.76 0.73 0.73 MeMeanan

0.840.840.850.850.750.750.740.741010

0.87 0.87 0.84 0.84 0.79 0.79 0.78 0.78 66

0.75 0.75 0.72 0.72 0.69 0.69 0.66 0.66 55

0.96 0.96 0.94 0.94 0.94 0.94 0.92 0.92 44

0.86 0.86 0.79 0.79 0.85 0.85 0.77 0.77 33

0.66 0.66 0.66 0.66 0.64 0.64 0.65 0.65 22

0.86 0.86 0.82 0.82 0.75 0.75 0.72 0.72 11

BOSBOSBOWBOWBOSBOSBOWBOW

RecallRecallPrecisionPrecisionIdId

Experiments


ResultsResults

Improvement in precision (+3%) and recall (+5%) The BOS model outperforms the BOW model specifically on datasets:

3 (+8% of precision, +7% of recall) 7 (+6% of precision, +9% of recall) 8 (+5% of precision, +8% of recall)

No improvement on dataset 2 (Animation) Low number of rated movies WSD errors (difficulty in disambiguating stories)

Experiments


Conclusions & Future WorksConclusions & Future Works Extending BOW to BOS improves classification accuracy when WSD in

performed on short documents Improved results are independent from the distribution of positive and

negative examples in the dataset

Integration of user profiles into UUCM [Metha et al. 2005] Ontologies and user profiles

Domain-specific ontologies

Final Remarks