Content-Based Video Recommendation System Based on Stylistic

Seediscussions,stats,andauthorprofilesforthispublicationat:https://www.researchgate.net/publication/291148210

Content-BasedVideoRecommendationSystemBasedonStylisticVisualFeatures

ArticleinJournalonDataSemantics·January2016

DOI:10.1007/s13740-016-0060-9

CITATIONS

5

READS

275

6authors,including:

Someoftheauthorsofthispublicationarealsoworkingontheserelatedprojects:

Project VideoRecommendationusingMultimediaContentFeaturesViewproject

Project VideoRecommendationusingMultimediaContentFeaturesViewproject

PaoloCremonesi

PolitecnicodiMilano

73PUBLICATIONS762CITATIONS

SEEPROFILE

FrancaGarzotto

PolitecnicodiMilano

159PUBLICATIONS2,918CITATIONS

SEEPROFILE

PietroPiazzolla

PolitecnicodiMilano


SEEPROFILE

MassimoQuadrana

PolitecnicodiMilano


SEEPROFILE

Availablefrom:YasharDeldjoo

Retrievedon:22September2016

https://www.researchgate.net/publication/291148210_Content-Based_Video_Recommendation_System_Based_on_Stylistic_Visual_Features?enrichId=rgreq-72921a3af32102f9c7415cba5e2124c0-XXX&enrichSource=Y292ZXJQYWdlOzI5MTE0ODIxMDtBUzozMjY2NDYzMDIyOTgxMTJAMTQ1NDg4OTk1NTU2OA%3D%3D&el=1_x_2

https://www.researchgate.net/publication/291148210_Content-Based_Video_Recommendation_System_Based_on_Stylistic_Visual_Features?enrichId=rgreq-72921a3af32102f9c7415cba5e2124c0-XXX&enrichSource=Y292ZXJQYWdlOzI5MTE0ODIxMDtBUzozMjY2NDYzMDIyOTgxMTJAMTQ1NDg4OTk1NTU2OA%3D%3D&el=1_x_3

https://www.researchgate.net/project/Video-Recommendation-using-Multimedia-Content-Features?enrichId=rgreq-72921a3af32102f9c7415cba5e2124c0-XXX&enrichSource=Y292ZXJQYWdlOzI5MTE0ODIxMDtBUzozMjY2NDYzMDIyOTgxMTJAMTQ1NDg4OTk1NTU2OA%3D%3D&el=1_x_9

https://www.researchgate.net/project/Video-Recommendation-using-Multimedia-Content-Features?enrichId=rgreq-72921a3af32102f9c7415cba5e2124c0-XXX&enrichSource=Y292ZXJQYWdlOzI5MTE0ODIxMDtBUzozMjY2NDYzMDIyOTgxMTJAMTQ1NDg4OTk1NTU2OA%3D%3D&el=1_x_9

https://www.researchgate.net/?enrichId=rgreq-72921a3af32102f9c7415cba5e2124c0-XXX&enrichSource=Y292ZXJQYWdlOzI5MTE0ODIxMDtBUzozMjY2NDYzMDIyOTgxMTJAMTQ1NDg4OTk1NTU2OA%3D%3D&el=1_x_1

https://www.researchgate.net/profile/Paolo_Cremonesi?enrichId=rgreq-72921a3af32102f9c7415cba5e2124c0-XXX&enrichSource=Y292ZXJQYWdlOzI5MTE0ODIxMDtBUzozMjY2NDYzMDIyOTgxMTJAMTQ1NDg4OTk1NTU2OA%3D%3D&el=1_x_4


https://www.researchgate.net/institution/Politecnico_di_Milano?enrichId=rgreq-72921a3af32102f9c7415cba5e2124c0-XXX&enrichSource=Y292ZXJQYWdlOzI5MTE0ODIxMDtBUzozMjY2NDYzMDIyOTgxMTJAMTQ1NDg4OTk1NTU2OA%3D%3D&el=1_x_6


https://www.researchgate.net/profile/Franca_Garzotto?enrichId=rgreq-72921a3af32102f9c7415cba5e2124c0-XXX&enrichSource=Y292ZXJQYWdlOzI5MTE0ODIxMDtBUzozMjY2NDYzMDIyOTgxMTJAMTQ1NDg4OTk1NTU2OA%3D%3D&el=1_x_4




https://www.researchgate.net/profile/Pietro_Piazzolla?enrichId=rgreq-72921a3af32102f9c7415cba5e2124c0-XXX&enrichSource=Y292ZXJQYWdlOzI5MTE0ODIxMDtBUzozMjY2NDYzMDIyOTgxMTJAMTQ1NDg4OTk1NTU2OA%3D%3D&el=1_x_4




https://www.researchgate.net/profile/Massimo_Quadrana?enrichId=rgreq-72921a3af32102f9c7415cba5e2124c0-XXX&enrichSource=Y292ZXJQYWdlOzI5MTE0ODIxMDtBUzozMjY2NDYzMDIyOTgxMTJAMTQ1NDg4OTk1NTU2OA%3D%3D&el=1_x_4




unco

rrec

ted

pro

of

J Data Semant

DOI 10.1007/s13740-016-0060-9

ORIGINAL ARTICLE

Content-Based Video Recommendation System Based on StylisticVisual Features

Yashar Deldjoo1· Mehdi Elahi1 · Paolo Cremonesi1 · Franca Garzotto1

·

Pietro Piazzolla1· Massimo Quadrana1

Received: 10 November 2015 / Accepted: 20 January 2016

© Springer-Verlag Berlin Heidelberg 2016

Abstract This paper investigates the use of automatically1

extracted visual features of videos in the context of rec-2

ommender systems and brings some novel contributions in3

the domain of video recommendations. We propose a new4

content-based recommender system that encompasses a tech-5

nique to automatically analyze video contents and to extract6

a set of representative stylistic features (lighting, color, and7

motion) grounded on existing approaches of Applied Media8

Theory. The evaluation of the proposed recommendations,9

assessed w.r.t. relevance metrics (e.g., recall) and com-10

pared with existing content-based recommender systems that11

exploit explicit features such as movie genre, shows that our12

technique leads to more accurate recommendations. Our pro-13

posed technique achieves better results not only when visual14

features are extracted from full-length videos, but also when15

the feature extraction technique operates on movie trailers,16

pinpointing that our approach is effective also when full-17

length videos are not available or when there are performance18

requirements. Our recommender can be used in combination19

with more traditional content-based recommendation tech-20

niques that exploit explicit content features associated to21

B Yashar Deldjoo

[email protected]

Mehdi Elahi

[email protected]

Paolo Cremonesi

[email protected]

Franca Garzotto

[email protected]

Pietro Piazzolla

[email protected]

Massimo Quadrana

[email protected]

1 Politecnico di Milano, Milan, Italy

video files, to improve the accuracy of recommendations. Our 22

recommender can also be used alone, to address the problem 23

originated from video files that have no meta-data, a typical 24

situation of popular movie-sharing websites (e.g., YouTube) 25

where every day hundred millions of hours of videos are 26

uploaded by users and may contain no associated informa- 27

tion. As they lack explicit content, these items cannot be 28

considered for recommendation purposes by conventional 29

content-based techniques even when they could be relevant 30

for the user. 31

1 Introduction 32

Recommender Systems (RSs) are characterized by the capa- 33

bility of filtering large information spaces and selecting the 34

items that are likely to be more interesting and attractive to 35

a user [1]. Recommendation methods are usually classified 36

into collaborative filtering methods, content-based methods 37

and hybrid methods [1–4]. Content-based methods, that are 38

among popular ones [5–7], suggest items which have content 39

characteristics similar to the ones of items a user liked in the 40

past. For example, news recommendations consider words 41

or terms in articles to find similarities. 42

A prerequisite for content-based filtering is the availability 43

of information about relevant content features of the items. 44

In most existing systems, such features are associated to 45

the items as structured or un-structured meta-information. 46

Many RSs in the movie domain, for instance, consider movie 47

genre, director, cast, (structured information), or plot, tags 48

and textual reviews (un-structured information). In contrast, 49

our work exploits “implicit” content characteristics of items, 50

i.e., features that are “encapsulated” in the items and must be 51

computationally “extracted” from them. 52

123

Journal: 13740 MS: 0060 TYPESET DISK LE CP Disp.:2016/2/5 Pages: 15 Layout: Large

Au

tho

r P

ro

of

http://crossmark.crossref.org/dialog/?doi=10.1007/s13740-016-0060-9&domain=pdf

unco

rrec

ted

pro

of

Y. Deldjoo et al.

We focus on the domain of video recommendations and53

propose a novel content-based technique that filters items54

according to visual features extracted automatically from55

video files, either full-length videos or trailers. Such features56

include lighting, color, and motion; they have a “stylistic”57

nature and, according to Applied Media Aesthetics [8], can58

be used to convey communication effects and to stimulate59

different feelings in the viewers.60

The proposed recommendation technique has been evalu-61

ated w.r.t. relevance metrics (e.g., recall), using conventional62

techniques for the off-line evaluation of recommender sys-63

tems that exploit machine learning methods [9,10]. The64

results have been then compared with existing content-based65

techniques that exploit explicit features such as movie genre.66

We consider three different experimental conditions—(a)67

visual features extracted from movie trailers; (b) visual fea-68

tures extracted by full-length videos; and (c) traditional69

explicit features based on genre—to test two hypotheses:70

1. Our recommendation algorithm based on visual features71

leads to a higher recommendation accuracy in com-72

parison with conventional genre-based recommender73

systems.74

2. Accuracy is higher when stylistic features are extracted75

from either full-length movies or when they originate76

from movie trailers only. In other words, for our recom-77

mender movie trailers are good representatives of their78

corresponding full-length movies.79

The evaluation study has confirmed both hypotheses and has80

shown that our technique leads to more accurate recommen-81

dations than the baselines techniques in both experimental82

conditions.83

Our work provides a number of contributions to the RS84

field in the video domain. It improves our understanding on85

the role of implicit visual features in the recommendation86

process, a subject which has been addressed by a limited87

number of researches. The proposed technique can be used88

in two ways:89

• “In combination with” other content-based techniques90

that exploit explicit content, to improve their accuracy.91

This mixed approach has been investigated and evaluated92

by other works [11,12]. Still, prior off-line evaluations93

have involved a limited number of users (few dozens )94

against the thousands employed in our study.95

• “Autonomously”, to replace traditional content-based96

approaches when (some) video items (typically the new97

ones) are not equipped with the explicit content fea-98

tures that a conventional recommender would employ99

to generate relevant recommendations. This situation,100

which hereinafter we refer to as “extreme new item101

problem” [13] typically occurs for example in popular102

movie-sharing websites (e.g., YouTube) where every day 103

hundred millions of hours of videos are uploaded by users 104

and may contain no meta-data. Conventional content- 105

based techniques would neglect to consider these new 106

items even if they may be relevant for recommendation 107

purposes, as the recommender has no content to analyze 108

but video files. To our knowledge, the generation of rec- 109

ommendations that exploit automatically extracted visual 110

features “only” has not been explored nor evaluated in 111

prior works. 112

As an additional contribution, our study pinpoints that our 113

technique is accurate when visual feature extraction oper- 114

ates both on full-length movies (which is a computationally 115

demanding process) and on movie trailers. Hence, our 116

method can be used effectively also when full-length videos 117

are not available or when it is important to improve perfor- 118

mance. 119

The rest of the paper is organized as follows. Section 2 120

reviews the relevant state of the art, related to content-based 121

recommender systems and video recommender systems. This 122

Section also introduces some theoretical background on 123

Media Esthetics that helps us to motivate our approach and 124

interpret the results of our study. Section 3 describes the pos- 125

sible relation between the visual features adopted in our work 126

and the esthetic variables that are well known for artists in the 127

domain of movie making. In Sect. 4, we describe our method 128

for extracting and representing implicit visual features of the 129

video and provide the details of our recommendation algo- 130

rithm. Section 5 introduces the evaluation method. Section 6 131

presents the results of the study and Sect. 7 discusses them. 132

Section 8 draws the conclusions and identifies open issues 133

and directions for future work. 134

2 Related Work 135

2.1 Content-Based Recommender Systems 136

Content-based RSs create a profile of a user’s preferences, 137

interests and tastes by considering the feedback provided by 138

the user to some items together with the content associated to 139

them. Feedback can be gathered either explicitly from users, 140

by explicitly asking them to rate an item [14], or implicitly 141

by analyzing her activity [15]. Recommendations are then 142

generated by matching the user profile against the features 143

of all items. Content can be represented using keyword-based 144

models, in which the recommender creates a Vector Space 145

Model (VSM) representation of item features, where an item 146

is represented by a vector in a multi-dimensional space. These 147

dimensions represent the features used to describe the items. 148

By means of this representation, the system measures a rel- 149

evance score that represents the user’s degree of interest 150

123


Au

tho

r P

ro

of

unco

rrec

ted

pro

of

Content-Based Video Recommendation System Based on Stylistic Visual Features

toward any of these items [6]. For instance, in the movie151

domain, the features that describe an item can be genre,152

actors, or director [16]. This model may allow content-based153

recommender systems to naturally tackle the new item prob-154

lem [13]. Other families of content-based RSs use semantic155

analysis (lexicons and ontologies) to create more accurate156

item representations [17–19].157

In the literature, a variety of content-based recommenda-158

tion algorithms have been proposed. A traditional example159

is the “k-nearest neighbor” approach (KNN) that computes160

the preference of a user for an unknown item by comparing161

it against all the items known by the user in the cata-162

log. Every known item contributes to predict the preference163

score according to its similarity with the unknown item. The164

similarity can be measured by typically using Cosine similar-165

ity [6,7]. There are also works that model the probability for166

the user to be interested to an item using a Bayesian approach167

[20], or use other techniques adopted from IR (Information168

Retrieval) such as the Relevance Feedback method [21].169

2.2 Recommender Systems in the Multimedia Domain170

In the multimedia domain, recommender systems typically171

exploit two types of item features, hereinafter referred to as172

High-Level features (HL) or Low-Level features (LL). High-173

Level features express properties of media content that are174

obtained from structured sources of meta-information such175

as databases, lexicons and ontologies, or from less struc-176

tured data such as reviews, news articles, item descriptions177

and social tags [16,20–24]. Low-Level features are extracted178

directly from media files themselves [25,26]. In the music179

recommendation domain, for example, Low-Level features180

are acoustic properties such as rhythm or timbre, which are181

exploited to find music tracks that are similar to those liked182

by a user [27–30].183

In the domain of video recommendation, a limited number184

of works have investigated the use of Low-Level features,185

extracted from pure visual contents, which typically rep-186

resent stylistic aspect of the videos [11,12,31,32]. Still,187

existing approaches consider only scenarios where Low-188

Level features are exploited in addition to another type of189

information with the purpose of improving the quality of190

recommendations. Yang et al. [11] proposes a video recom-191

mender system, called VideoReach, which use a combination192

of High-Level and Low-Level video features of different193

nature—textual, visual and aural—to improve the click-194

through rate. Zhao et al. [12] proposes a multi-task learning195

algorithm to integrate multiple ranking lists, generated using196

different sources of data, including visual content. As none of197

these works use Low-Level visual features only, they cannot198

be applied when the extreme new item problem [33] occurs,199

i.e., when only video files are available and high-level infor-200

mation is missing.201

2.3 Video Retrieval 202

A Content-Based Recommender System (CBRS) for videos 203

is similar to a Content-Based Video Retrieval system (CBVR) 204

in the sense that both systems analyze video content to search 205

for digital videos in large video databases. However, there 206

are major differences between these systems. For example, 207

people use popular video sharing websites such as YouTube 208

for three main purposes [34]: (1) Direct navigation: to watch 209

videos that they found at specific websites, (2) Search: to 210

watch videos around a specific topic expressed by a set of 211

keywords, (3) Personal entertainment: to be entertained by 212

the content that matches their taste. A CBVR system is com- 213

posed of a set of techniques that typically address the first 214

and the second goal, while a CBRS focuses on the third goal. 215

Accordingly, the main differences between CBRS and CBVR 216

can be listed as follows [11]: 217

1. Different objectives: The goal of a CBVR system is to 218

search for videos that match “a given query” provided 219

directly by a user as a textual or video query etc. The goal 220

of a video CBRS is, however, to search for videos that 221

are matched with “user taste” (also known as user profile) 222

and can be obtained by analyzing his past behavior and 223

opinions on different videos. 224

2. Different inputs: The input to a CBVR system typically 225

consists of a set of keywords or a video query where the 226

inputs could be entirely un-structured and do not have any 227

property per se. The input to a video CBRS on top of the 228

video content includes some or many features obtained 229

from user modeling (user profile, tasks, activities), the 230

context (location, time, group) and other sources of infor- 231

mation. 232

3. Different features: In general, video content features can 233

be classified into three rough hierarchical levels [35]: 234

• Level 1: Stylistic low-level that deals with modeling 235

the visual styles in a video. 236

• Level 2: Syntactic level that deals with finding objects 237

and their interaction in a video. 238

• Level 3: Semantic level that deals with conceptual 239

modeling of a video. 240

People most often rely on content features derived from 241

level 2 and 3 to search for videos as they reside closer to 242

human understanding. Even for recommender systems, 243

most CBRSs use video meta-data (genre, actor etc.) that 244

reside in higher syntactic and semantical levels to pro- 245

vide recommendations. One of the novelties of this work 246

is to explore the importance of stylistic low-level fea- 247

tures in human’s perception of movies. Movie directors 248

drastically use the human’s perception in stages of movie 249

creation, to convey emotions and feeling to the users. We 250

thus conclude that CBVR system and CBRS deal with 251

123


Au

tho

r P

ro

of

unco

rrec

ted

pro

of

Y. Deldjoo et al.

video content modeling at different levels depending on252

suitability of the feature for a particular application.253

While low-level features have been marginally explored254

in the community of recommender systems, they have been255

extensively studied in other fields such as Computer Vision256

and Content-Based Video Retrieval [36–38]. For different257

objectives, these communities share with the community of1 258

recommender systems, the research problems of defining259

the “best” representation of video content and of classify-260

ing videos according to features of different nature. Hence,261

they offer results and insights that are of interest also in262

the movie recommender systems context.Hu et al. [36] and263

Brezeale and Cook [39] provide comprehensive surveys on264

the relevant state of the art related to video content analysis265

and classification, and discuss a large body of low-level fea-266

tures (visual, auditory or textual) that can be considered for267

these purposes. Rasheed et al. [38] propose a practical movie268

genre classification scheme based on computable visual cues.269

Rasheed and Shah [40] discuss a similar approach by consid-270

ering also the audio features. Finally, Zhou et al. [41] propose271

a framework for automatic classification, using a temporally272

structured features, based on the intermediate level of scene273

representation.274

2.4 Video Features from a Semiotic Perspective275

The stylistic visual features of videos that we exploit in our276

recommendation algorithm have been studied not only in277

Computer Science but also from a semiotic and expressive278

point of view, in the theory and practice of movie making279

(see Sect. 3). Lighting, color, and camera motion are impor-280

tant elements that movie makers consider in their work to281

convey meanings, or achieve intended emotional, esthetic, or282

informative effects. Applied Media Aesthetic [8] is explicitly283

concerned with the relation of media features (e.g., lights,284

shadows, colors, space representation, camera motion, or285

sound) with perceptual, cognitive, and emotional reactions286

they are able to evoke in media consumers, and tries to iden-287

tify patterns in how such features operate to produce the288

desired effect [42]. Some aspects concerning these patterns289

([38,43]) can be generated from video data streams as sta-290

tistical values and can be used to computationally identify291

correlations with the user profile, in terms of perceptual and292

emotional effects that users like.293

3 Artistic Background294

In this section, we provide the artistic background to the idea295

of using stylistic visual features for movie recommendation.296

We describe the stylistic visual features from an artistic point297

of view and explain the possible relation between these low-298

level features and the esthetic variables that are well known 299

for artists in the domain of movie making. 300

As noted briefly in Sect. 2, the study on how various 301

esthetic variables and their combination contribute to estab- 302

lish the meaning conveyed by an artistic work is the domain 303

of different disciplines, e.g., semiotics, traditional esthetic 304

studies, etc. The shared belief is that humans respond to cer- 305

tain stimuli (being them called signs, symbols, f eatures 306

depending on the discipline) in ways that are predictable, up 307

to a given extent. One of the consequences about this is that 308

similar stimuli are expected to provoke similar reactions, and 309

this in turn may allow to group similar works of art together 310

by the reaction they are expected to provoke. 311

Among these disciplines, of particular interest for this 312

paper, Applied Media Aesthetic [8] is concerned with the 313

relation of a number of media elements, such as light, cam- 314

era movements, colors, with the perceptual reactions they are 315

able to evoke in consumers of media communication, mainly 316

videos and films. Such media elements, that together build 317

the visual images composing the media, are investigated fol- 318

lowing a rather formalistic approach that suits the purposes of 319

this paper. By an analysis of cameras, lenses, lighting, etc., 320

as production tools as well as their esthetic characteristics 321

and uses, Applied Media Aesthetic tries to identify patterns 322

in how such elements operate to produce the desired effect 323

in communicating emotions and meanings. 324

The image elements that are usually addressed as fun- 325

damental in the literature, e.g. in [42] , even if with slight 326

differences due to the specific context, are lights and shad- 327

ows, colors, space representation, motion and sound. It has 328

been proved, e.g. in [38,43], that some aspects concerning 329

these elements can be computed from the video data stream 330

as statistical values. We call these computable aspects as fea- 331

tures. 332

We will now look into closer details of the features, inves- 333

tigated for content-based video recommendation in this paper 334

to provide a solid overview on how they are used to producing 335

perceptual reaction in the audience. Sound will not be fur- 336

ther discussed, since it is out of scope of this work, as well 337

as the space representation, that concerns, e.g., the different 338

shooting angles that can be used to represent dramatically an 339

event. 340

3.1 Lighting 341

There are at least two different purposes for lighting in video 342

and movies. The most obvious is to allow and define viewers’ 343

perception of the environment, to make visible the objects 344

and places they look at. But light can also manipulate how, 345

emotionally, an event is to be perceived, acting in a way that 346

bypass rational screens. The two main lighting alternatives 347

are usually addressed to as chiaroscuro and f lat lighting, 348

but there are many intermediate solutions between them. The 349

123


Au

tho

r P

ro

of

unco

rrec

ted

pro

of


Fig. 1 a Out of the past (1947)

an example of highly contrasted

lighting. b The wizard of OZ

(1939) flat lighting example

Fig. 2 a An image from

Django Unchained (2012). The

red hue is used to increase the

scene sense of violence. b An

image from Lincoln (2012).

Blue tone is used to produce the

sense of coldness and fatigue

experienced by the characters

first is a lighting technique characterized by high contrast350

between light and shadow areas that put the emphasis on351

an unnatural effect: the borders of objects are altered by the352

lights. The latter instead is an almost neutral, realistic, way of353

illuminating, whose purpose is to enable recognition of stage354

objects. Figure 1a, b exemplifies these two alternatives.355

3.2 Colors356

The expressive quality of colors is closely related to that of357

lighting, sharing the same ability to set or magnify the feeling358

derived by a given situation. The problem with colors is the359

difficulty to isolate their contribution to the overall ‘mood’ of360

a scene from that of other esthetic variables operating in the361

same context. Usually their effectiveness is higher when the362

context as a whole is predisposed toward a specific emotional363

objective.364

Even if an exact correlation between colors and the feeling365

they may evoke is not currently supported by enough scien-366

tific data, colors nonetheless have an expressive impact that367

has been investigated thoroughly, e.g. in [44]. An interesting368

metric to quantify this impact has been proposed in [45] as369

perceived color energy, a quantity that depends on a color’s370

saturation, brightness and the size of the area the color covers371

in an image. Also the hue plays a role as if it tends toward372

reds, the quantity of energy is more, while if it tends more on373

blues, it is less. These tendencies are shown in the examples 2374

of Fig. 2a, b. 375

3.3 Motion 376

The illusion of movement given by screening a sequence of 377

still frames in rapid succession is the very reason of cinema 378

existence. In a video or movie, there are different types of 379

motions to consider: 380

• Profilmic movements Every movement that concerns ele- 381

ments, shot by the camera, falls in this category, e.g. 382

performers motion, vehicles, etc.. The movement can be 383

real or perceived. By deciding the type and quantity of 384

motion an ‘actor’ has, considering as actor any possible 385

protagonist of a scene, the director defines, among others, 386

the level of attention to, or expectations from, the scene. 387

As an example, the hero walking slowly in a dark alley, 388

or a fast car chasing. 389

• Camera movements Are the movements that alter the 390

point of view on the narrated events. Camera movements, 391

such as the pan, truck, pedestal, dolly, etc. can be used 392

for different purposes. Some uses are descriptive, to intro- 393

duce landscapes or actors, to follow performers actions, 394

and others concern the narration, to relate two or more 395

123


Au

tho

r P

ro

of

unco

rrec

ted

pro

of

Y. Deldjoo et al.

different elements, e.g., anticipating a car’s route to show396

an unseen obstacle, to move toward or away from events.397

• Sequences movements As shots changes, using cuts398

or other transitions, the rhythm of the movie changes399

accordingly. Generally, a faster rhythm is associated with400

excitement, and a slower rhythm suggests a more relaxed401

pace [46].402

In this paper, we followed the approach in [38], con-403

sidering the motion content of a scene as a feature that404

aggregate and generalize both profilmic and camera move-405

ments. Sequence movements are instead considered in the406

Average shot length feature, both being described with407

detail in the next section.408

4 Method Description409

The first step to build a video CBRS based on stylistic low-410

level features is to search for features that comply with human411

visual norms of perception and abide by the grammar of the412

film—the rules that movie producers and directors use to413

make movies. In general, a movie M can be represented as a414

combination of three main modalities: MV the visual, MA the415

audio and MT textual modalities, respectively. The focus of416

this work is only on visual features, therefore M = M(MV).417

The visual modality itself can be represented as418

MV = MV(fv) (1)419

where fv = ( f1, f2, . . . , fn) is a set of features that describe420

the visual content of a video. Generally speaking, a video421

can be considered as contiguous sequence of many frames.422

Consecutive video frames contain a lot of frames that are423

highly similar and correlated. Considering all these frames424

for feature extraction not only does not provide new infor-425

mation to the system but also is computationally inefficient.426

Therefore, the first step prior to feature extraction is struc-427

tural analysis of the video, i.e. to detect shot boundaries and428

to extract a key frame within each shot. A shot boundary is a429

frame where frames around it have significant difference in430

their visual content. Frames within a shot are highly similar431

on the other hand, therefore it makes sense to take one rep-432

resentative frame in each shot and use that frame for feature433

extraction. This frame is called the Key Frame.434

Figure 3 illustrates the hierarchical representation of a435

video. Two types of features are extracted from videos: (1)436

temporal features (2) spatial features. The temporal features437

reflect the dynamic perspectives in a video such as the aver-438

age shot duration (or shot length) and object motion, whereas439

the spatial features illustrate static properties such as color,440

light, etc. In the following we describe in more detail these441

Fig. 3 Hierarchical video representation and feature extraction in our

framework

features and the rationale behind choosing them in addition 442

to how they can be measured in a video. 443

4.1 Visual Features 444

To demonstrate the effectiveness of the proposed video 445

CBRS, after carefully studying the literature in computer 446

vision, we selected and extracted the five most informative 447

and distinctive features to be extracted from each video 448

fv = ( f1, f2, f3, f4, f5) 449

= (Lsh, µcv, µm, µσ 2m, µlk) (2) 450

where Lsh is the average shot length, µcv is the mean color 451

variance over key frames, µm and µσ 2m

are the mean motion 452

average and standard deviation across all frames, respec- 453

tively, and µσ 2m

is the mean lightening key over key frames. 454

As can be noted, some of the features are calculated across 455

key frames and the others across all video frames (see Fig. 3). 456

Each of these features carry a meaning and are used in the 457

hands of able directors to convey emotions when shooting 458

movies. Assuming that there exists n f frames in the video, t 459

being the index of each single frame and nsh key frames (or 460

shots), q being the index of a numbered list of key frames, 461

the proposed visual features and how they are calculated is 462

presented in the following [25,26,38] 463

• Average shot length (Lsh) A shot is a single camera action 464

and the number of shots in a video can provide useful 465

information about the pace at which a movie is being 466

created. The average shot length is defined as 467

Lsh =n f

nsh

(3) 468

where n f is the number of frames and nsh the num- 469

ber of shots in a movie. For example, action movies 470

usually contain rapid movements of the camera (there- 471

fore, they contain higher number of shots or shorter shot 472

123


Au

tho

r P

ro

of

unco

rrec

ted

pro

of


lengths) compared to dramas which often contain conver-473

sations between people (thus longer average shot length).474

Because movies can be made a different frame rates, Lsh475

is further normalized by the frame rate of the movie.476

• Color variance (µcv) The variance of color has a strong477

correlation with the genre. For instance, directors tend478

to use a large variety of bright colors for comedies and479

darker hues for horror films. For each key frame repre-480

sented in Luv color space we compute the covariance481

matrix:482

ρ =

⎛

⎜

⎝

σ 2L σ 2

Lu σ 2Lv

σ 2Lu σ 2

u σ 2uv

σ 2Lv σ 2

uv σ 2v

⎞

⎟

⎠(4)483

The generalized variance can be used as the represen-484

tative of the color variance in each key frame given by485

486

�q = det(ρ) (5)487

in which a key frame is a representative frame within a488

shot (e.g. the middle shot). The average color variance is489

then calculated by:490

µcv =

∑nsh

q=1 �q

nsh

(6)491

where nsh is the number of shots equal to number of key492

frames.493

• Motion within a video can be caused mainly by the cam-494

era movement (i.e. camera motion) or movements on part495

of the object being filmed (i.e. object motion). While496

the average shot length captures the former characteris-497

tic of a movie, it is desired for the motion feature to also498

capture the latter characteristic. A motion feature descrip-499

tor based on optical flow [47,48] is used to measure a500

robust estimate of the motion in sequence of images based501

on velocities of images being filmed. Because motion502

features are based upon sequence of images, they are503

calculated across all video frames.504

At frame t , if the average motion of pixels is represented505

by mt and the standard deviation of pixel motions is506

(σ 2m)t :507

µm =

∑n f

t=1 mt

n f

(7)508

and509

µσ 2m

=

∑n f

t=1(σ2m)t

n f

(8)510

where µm and µσ 2m

represent the average of motion mean 511

and motion standard deviation aggregated over entire n f 512

frames. 513

• Lighting Key is another distinguishing factor between 514

movie genres in such a way that the director uses it as 515

a factor to control the type of emotion they want to be 516

induced to a viewer. For example, comedy movies often 517

adopt lighting key which has abundance of light (i.e. high 518

gray-scale mean) with less contrast between the brightest 519

and dimmest light (i.e. high gray-scale standard devia- 520

tion). This trend is often known as high-key lighting. On 521

the other hand, horror movies or noir films often pick 522

gray-scale distributions which is low in both gray-scale 523

mean and gray-scale standard deviation, known by low- 524

key lighting. To capture both of these parameters, after 525

transforming all key-frames to HSV color space [49], 526

we compute the mean µ and standard deviation σ of the 527

value component which corresponds to the brightness. 528

The scene lighting key ξ defined by multiplication of µ 529

and σ is used to measure the lighting of key frames 530

ξq = µ · σ (9) 531

For instance, comedies often contain key-frames which 532

have a well distributed gray-scale distribution which 533

results in both the mean and standard deviation of gray- 534

scale values to be high, therefore, for comedy genre one 535

can state ξ > τc, whereas for horror movies the light- 536

ing key with poorly distributed lighting the situation is 537

reverse and we will have ξ < τh , where τc and τh are pre- 538

defined thresholds. In the situation where τh < ξ < τc 539

other movie genres (e.g. Drama) exists where it is hard 540

to use the above distinguish factor for them. The average 541

lighting calculated over key frames is given by (10) 542

µlk =

∑nsh

q=1 ξq

nsh

(10) 543

It is worth noting that the stylistic visual features have been 3544

extracted by using our own implementation. The code and 545

the dataset of extracted features will be publicly accessible 546

through the webpage of the group.1 547

4.2 Recommendation Algorithm 548

To generate recommendations using our Low-Level stylistic 549

visual features, we adopted a classical “k-nearest neighbor” 550

content-based algorithm. Given a set of users u ∈ U and a 551

catalog of items i ∈ I , a set of preference scores rui given 552

by user u to item i has been collected. Moreover, each item 553

i ∈ I is associated to its feature vector fi . For each couple 554

1 http://recsys.deib.polimi.it/.

123


Au

tho

r P

ro

of

http://recsys.deib.polimi.it/

unco

rrec

ted

pro

of

Y. Deldjoo et al.

of items i and j , the similarity score si j is computed using555

cosine similarity:556

si j =fi

T f j

‖ fi‖∥

∥ f j

∥

∥

(11)557

For each item i the set of its nearest neighbors N Ni is558

built, |N Ni | < K . Then, for each user u ∈ U , the predicted559

preference score rui for an unseen item i is computed as560

follows561

rui =

∑

j∈N Ni ,ru j >0 ru j si j∑

j∈N Ni ,ru j >0 si j

(12)562

5 Evaluation Methodology563

We have formulated the following two hypotheses:564

1. the content-based recommender system, that exploits a565

set of representative visual features of the video contents,566

may have led to a higher recommendation accuracy in567

comparison to the genre-based recommender system.568

2. the trailers of the movies can be representative of their569

full-length movies, with respect to the stylistic visual fea-570

tures, and indicate high correlation with them.571

Hence, we speculate that a set of stylistic visual fea-572

tures, extracted automatically, may be more informative of573

the video content than a set of high-level expert annotated574

features.575

To test these hypotheses, we have evaluated the Top-N rec-576

ommendation quality of each content-based recommender577

systems by running a fivefold cross-validation on a subset of578

the MovieLens-20M dataset [50]. The details of the subset579

are described later in the paper. The details on the evaluation580

procedure follow.581

First, we generated five disjoint random splits of the ratings582

in the dataset. Within each iteration, the evaluation procedure583

closely resembles the one described in [9]. For each iteration,4 584

one split was used as the probe set Pi , while the remaining585

ones were used to generate the training set Mi and was used to586

train the recommendation algorithm. The test set Ti contains587

only 4-star and 5-star ratings from Pi , which we assume to588

be relevant.589

For each relevant item i rated by user u in Ti , we form a590

list containing the item i and all the items not rated by the591

user u, which we assume to be irrelevant to her. Then, we592

formed a top-N recommendation list by picking the top N593

ranked items from the list. Being r the rank of i , we have a594

hit if r < N , otherwise we have a miss. Since we have one595

single relevant item per test case, recall can assume value 0596

Table 1 General information

about our dataset# Items 167

# Users 139, 190

# Ratings 570, 816

Table 2 Distribution of movies in our catalog

Action Comedy Drama Horror Mixed Total

# 29 27 25 24 62 167

% 17 16 15 14 38 100

or 1 (respectively in case of a miss or a hit). Therefore, the 597

recall(N) on the test set Ti can be easily computed as: 598

recall(N ) =#hits

|Ti |(13) 599

We could have also evaluated the precision(N) on the rec- 600

ommendation list, but since it is related to the value of the 601

recall(N) by a simple scaling factor 1/N [9], we decided to 602

omit it to avoid redundancy. The values reported throughout 603

the paper are the averages over the fivefolds. 604

We have used a set of full-length movies and their trailers 605

that were sampled randomly from all the main genres, i.e., 606

Action, Comedy, Drama and Horror. The summary of the 607

dataset is given in Table 1. 608

As noted before, the movie titles were selected randomly 609

from MovieLens dataset, and the files were obtained from 610

YouTube [51]. The dataset contained over all 167 movies, 611

105 of which belonging to a single genre and 62 movies 612

belonging to multiple genres (see Table 2). 613

The proposed video feature extraction algorithm was 614

implemented in MATLAB R2015b2 on a workstation with 615

an Intel Xeon(R) eight-core 3.50 GHz processor and 32 GB 616

RAM. The Image Processing Toolbox (IPT) and Computer 617

Vision Toolbox (CVT) in MATLAB provide the basic ele- 618

ments for feature extraction and were used in our work for 619

video content analysis. In addition, we used the R statisti- 620

cal computing language3 together with MATLAB for data 621

analysis. For video classification, we took advantage of all 622

the infrastructure in Weka4 that provides an easy-to-use and 623

standard framework for testing different classification algo- 624

rithms. 625

2 http://www.mathworks.com/products/matlab.

3 https://www.r-project.org.

4 http://www.cs.waikato.ac.nz/ml/weka.

123


Au

tho

r P

ro

of

http://www.mathworks.com/products/matlab

https://www.r-project.org

http://www.cs.waikato.ac.nz/ml/weka

unco

rrec

ted

pro

of


6 Results626

6.1 Classification Accuracy627

We have conducted a preliminary experiment to understand628

if the genre of the movies can be explained in terms of the629

five low-level visual features, described in Sect. 4. The goal630

of the experiment is to classify the movies into genres by631

exploiting their visual features.632

6.1.1 Experiment A633

To simplify the experiment, we have considered 105 movies634

tagged with one genre only (Table 2). We have experimented635

with many classification algorithms and obtained the best636

results under decision tables [52]. Decision tables can be637

considered as tabular knowledge representations [53]. Using638

this technique, the classification for a new instance is done639

by searching for the exact matches in the decision table cells,640

and then the instance is assigned to the most frequent class641

among all instances matching that table cell [54].642

We have used tenfold cross-validation and obtained an643

accuracy of 76.2 % for trailers and 70.5 % for full-length644

movies. The best classification was done for comedy genre:645

23 out of 27 movie trailers were successfully classified in646

their corresponding comedy genre. On the other hand, the647

most erroneous classification happened for the horror genre648

where more number of movie trailers have been misclassified649

into the other genres. For example, 4 out of 24 horror movie650

trailers have been mistakenly classified as action genre. This651

is a phenomenon that was expected, since typically there are652

many action scenes occurred in horror movies, and this may653

make the classification very hard. Similar results have been654

observed for full-length movies.655

From this first experiment we can conclude that the five656

low-level stylistic visual features used in our experiment657

can be informative of the movie content and can be used658

to accurately classify the movies into their corresponding659

genres.660

6.2 Correlation between Full-Length Movies and661

Trailers662

One of the research hypotheses we have formulated addresses663

the possible correlation between the full-length movies and664

their corresponding trailers. Indeed, we are interested to665

investigate whether or not the trailers are representative of666

their full-length movies, with respect to the stylistic visual667

features. To investigate this issue, we have performed two668

experiments.669

6.2.1 Experiment B 670

We have first extracted the low-level visual features from 671

each of the 167 movies and their corresponding trailers in 672

our dataset. We have then computed the cosine similarity 673

between the visual features extracted from the full-length 674

movies and the trailers. Cosine is the same metric used to 675

generate recommendations in our experiments (as explained 676

in Sect. 5), and hence, it is a reliable indicator to evaluate if 677

recommendations based on trailers are similar to recommen- 678

dations based on the full-length movies. 679

Figure 4 plots the histogram of the cosine similarity. Aver- 680

age is 0.78, median is 0.80. More than 75 % of the movies 681

have a cosine similarity greater than 0.7 between the full- 682

length movie and trailer. Moreover, less than 3 % of the 683

movies have a similarity below 0.5. 684

Overall, the cosine similarity shows a substantial correla- 685

tion between the full-length movies and trailers. This is an 686

interesting outcome that basically indicates that the trailers 687

of the movies can be considered as good representatives of 688

the corresponding full-length movies. 689

6.2.2 Experiment C 690

In the second experiment, we have used the low-level fea- 691

tures extracted from both trailers and the full-length movies 692

to feed the content-based recommender system described in 693

Sect. 4.2. We have used features f3–f5 (i.e., camera motion, 694

object motion, and light) as they proved to be the best choice 695

of stylistic visual features (as described in Sect. 6.3). Qual- 696

ity of recommendations has been evaluated according to the 697

methodology described in Sect. 5. 698

Cosine Similarity0.4 0.5 0.6 0.7 0.8 0.9 1

Pro

babili

ty

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35Histogram of Similarities

Fig. 4 Histogram distribution of the cosine similarity between full-

length movies and trailers

123


Au

tho

r P

ro

of

unco

rrec

ted

pro

of

Y. Deldjoo et al.

1 1.5 2 2.5 3 3.5 4 4.5 5

N

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

recall

LL (stylistic)

HL (genre)

random

(a)

1 1.5 2 2.5 3 3.5 4 4.5 5

N

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

recall

LL (stylistic)

HL (genre)

random

(b)

Fig. 5 Performance comparison of different CB methods under best feature combination for full-length movies (a), and trailers (b). K = 2 for LL

features and K = 10 for HL features

Figure 5 plots the recall@N for full-length movies (a) and699

trailers (b), with values of N ranging from 1 to 5. We note700

that the K values have been determined with cross-validation701

(see Sect. 6.3.1). By comparing the two figures, it is clear that702

the recall values of the content-based recommendation using703

the features extracted from the full-length movies and trailers704

are almost identical.705

The results of this second experiment confirm that low-706

level features extracted from trailers are representative of the707

corresponding full-length movies and can be effectively used708

to provide recommendations.709

6.3 Recommendation Quality710

In this section, we investigate our main research hypothesis: if711

low-level visual features can be used to provide good-quality712

recommendations. We compare the quality of content-based713

recommendations based on three different types of features:714

Low Level (LL): stylistic visual features.715

High Level (HL): semantic features based on genres.716

Hybrid (LL+HL): both stylistic and semantic features.717

6.3.1 Experiment D718

To identify the visual features that are more useful in terms of719

recommendation quality, we have performed an exhaustive720

set of experiments by feeding a content-based recommender721

system with all the 31 combinations of the five visual features722

f1–f5. Features have been extracted from the trailers. We have723

also combined the low-level stylistic visual features with the724

genre, resulting in 31 additional combinations. When using725

two or more low-level features, each feature has been nor-726

malized with respect to its maximum value (infinite norm).727

Table 3 reports Recall@5 for all the different experimental 728

conditions. The first column of the table describes which 729

combination of low-level features has been used (1 =feature 730

used, 0 =feature not used). The last column of the table 731

reports, as a reference, the recall when using genre only. This 732

value does not depend on the low-level features. The optimal 733

value of K for the KNN similarity has been determined with 734

cross-validation: K = 2 for low level and hybrid, K = 10 735

for genre only. 736

Recommendations based on low-level stylistic visual fea- 737

tures extracted from trailers are clearly better, in terms 738

of recall, than recommendations based on genre for any 739

combination of visual features. However, no considerable 740

difference has been observed between genre-based and 741

hybrid-based recommendations. 742

It is worth noting that, when using low-level feature f2 743

(color variance), recommendations have a lower accuracy 744

with respect to the other low-level features, although always 745

better with respect to genre-based recommendation. More- 746

over, when using two or more low-level features together, 747

accuracy does not increase. These results will be further 748

investigated in the next section. 749

6.4 Feature Analysis 750

In this section, we wish to investigate why some low-level 751

visual features provide better recommendations than the oth- 752

ers, as highlighted in the previous section. Moreover, we 753

investigate why combinations of low-level features do not 754

improve accuracy. 755

6.4.1 Experiment E 756

In a first experiment, we analyze if some of the low-level 757

features extracted from trailers are better correlated than the 758

123


Au

tho

r P

ro

of

unco

rrec

ted

pro

of


Table 3 Performance comparison of different CB methods, in terms of

Recall metric, for different combination of the Stylistic visual features

Features Recall@5

LL stylistic LL+HL Hybrid HL genre

f1 f2 f3 f4 f5 (K = 2) (K = 2) (K = 10)

0 0 0 0 1 0.31 0.29 0.21

0 0 0 1 0 0 32 0.29

0 0 1 0 0 0.31 0.22

0 1 0 0 0 0.27 0.23

1 0 0 0 0 0.32 0.25

0 0 0 1 1 0.32 0.21

0 0 1 0 1 0.31 0.22

0 0 1 1 0 0.32 0.22

0 0 1 1 1 0.32 0.23

0 1 0 0 1 0.24 0.20

0 1 0 1 0 0.25 0.20

0 1 0 1 1 0.25 0.22

0 1 1 0 0 0.24 0.20

0 1 1 0 1 0.23 0.20

0 1 1 1 0 0.25 0.18

0 1 1 1 1 0.25 0.22

1 0 0 0 1 0.31 0.26

1 0 0 1 0 0.31 0.29

1 0 0 1 1 0.31 0.18

1 0 1 0 0 0.30 0.23

1 0 1 0 1 0.30 0.22

1 0 1 1 0 0.31 0.24

1 0 1 1 1 0.31 0.23

1 1 0 0 0 0.25 0.20

1 1 0 0 1 0.23 0.20

1 1 0 1 0 0.25 0.22

1 1 0 1 1 0.25 0.21

1 1 1 0 0 0.22 0.20

1 1 1 0 1 0.21 0.20

1 1 1 1 0 0.25 0.21

1 1 1 1 1 0.25 0.21

others with respect to the corresponding features extracted759

from the full-length movie. This analysis is similar to the one760

reported in Sect. 6.2, but results are reported as a function of761

the features.762

Figure 7 plots the cosine similarity values between visual763

features extracted from the full-length movies and visual764

features extracted from trailers. Features f2 and f4 (color765

variance and object motion) are the less similar features, sug-766

gesting that their adoption, if extracted from trailers, should767

provide less accurate recommendations.768

We also performed Wilcoxon test comparing features769

extracted from the full-length movies and trailers. The770

results, summarized in Table 4, prove that no significant dif-771

Table 4 Significance test with respect to features in 2 set of datasets

(movie trailers and full movies)

f1(Lsh) f2(µcv) f3(µm) f4(µσ 2m) f5(µlk)

Wilcox.test 1.3e−9 5.2e−5 0.154 2.2e−16 0.218

H0/H1 w -> H1 w -> H1 w -> H0 w -> H1 w -> H0

ference exists between features f3 (camera motion) and f5 772

(light), which clearly shows that the full-length movies and 773

trailers are highly correlation with respect to these two fea- 774

tures. For the other features, significant differences have been 775

obtained. This basically states that some of the extracted fea- 5776

tures may be either not correlated or not very informative. 777

6.4.2 Experiment F 778

In Fig. 6, scatter plots of all combinations of the five styl- 779

istic visual features (intra-set similarity) are plotted. Having 780

visually inspected, it can be seen that overall the features 781

are weakly correlated. However, there are still features that 782

mutually present high degree of linear correlation. For exam- 783

ple, features 3 and 4 seem to be highly correlated (see row 784

3, column 4 in Fig. 6). Moreover, we have observed similar- 785

ity by comparing the scatter plots of full-length movies and 786

trailers. Indeed, any mutual dependency between different 787

features extracted either from full-length movies or trailers 788

was similar. This is another indication that trailers can be 789

considered as representative short version of the full-length 790

movies, in terms of stylistic visual features. 791

6.4.3 Informativeness of the Features 792

Entropy is an information theoretic measure [55] that is an 793

indication of the informativeness of the data. Figure 8 illus- 794

trates the entropy scores computed for the stylistic visual 795

features. As it can be seen, the entropy scores of almost all 796

visual stylistic features are high. The most informative fea- 797

ture, in terms of entropy score, is the fifth one, i.e. Lighting 798

Key, and the least informative feature is the second one, i.e., 799

Color Variance (see Sect. 4 for detailed description). This 800

observation is in the full consistency with the other findings, 801

that we have obtained from, e.g. Wilcoxon test (see Table 4), 802

and correlation analysis (see Fig. 7). 803

Having considered all the results, we remark that our con- 804

sidered hypotheses have been successfully validated, and we 805

have shown that a proper extraction of the visual stylistic 806

features of videos may have led to higher accuracy of video 807

recommendation, than the typical expert annotation method, 808

either when the features are extracted from full-length videos 809

or when the features originate from movie trailers only. These 810

are promising results, as they overall illustrate the possibility 811

to achieve higher accuracy with an automatic method than a 812

manual method (i.e., expert annotation of videos) since the 813

123


Au

tho

r P

ro

of

unco

rrec

ted

pro

of

Y. Deldjoo et al.

Fig. 6 Scatter plot for different

combination of the stylistic

visual features extracted from

the movie trailers

Scatter plot of the Features (Trailers)

0 0.5 10 0.5 10 0.5 10 0.5 10 0.5 1

0

0.5

1

0

0.5

1

0

0.5

1

0

0.5

1

0

0.5

1

Features1 2 3 4 5

Co

sin

e

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Similarities between Features of Full Movies and Trailers

Fig. 7 Cosine similarity between stylistic visual features extracted

from the full-length movies and their corresponding trailers

manual method can be very costly and in some cases even814

impossible (e.g., in huge datasets).815

7 Discussion816

The results presented in the previous section confirm both817

our research hypothesis:818

1. recommendations based on low-level stylistic visual819

features are better than recommendations based on high-820

level semantic features, such as genre;821

2. low-level features extracted from trailers are, in gen-822

eral, a good approximation of the corresponding features823

extracted from the original full-length movies.824

Features1 2 3 4 5

En

tro

py

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Entropy of the Features (10 classes)

Fig. 8 Entropy of the stylistic visual features

7.1 Quality of Recommendations 825

According to Table 3, all the low-level visual features provide 826

better recommendations than the high-level features (genres). 827

The improvement is particularly evident when using either 828

scene duration, light, camera movement, or object movement, 829

with an improvement of almost 50 % in terms of recall with 830

respect to genre-based recommendations. The improvement 831

is less strong when using color variance, suggesting that user 832

opinions are not strongly affected by how colors are used in 833

movies. This is partially explained by the limited informative 834

content of the color variance feature, as show in Fig. 8, where 835

color variance is the feature with the lowest entropy. 836

The validity of this finding is restricted to the actual exper- 837

imental conditions considered and may be affected by the 838

123


Au

tho

r P

ro

of

unco

rrec

ted

pro

of


limited size of the dataset. In spite of these limitations, our839

results provide empirical evidence that840

the tested low-level visual features may provide predic-841

tive power, comparable to the genre of the movies, in842

predicting the relevance of movies for users.843

Surprisingly, mixing low-level and high-level features does844

not improve the quality of recommendations and, in most845

cases, the quality is reduced with respect to use of low-level846

only, as shown in Table 3. This can be explained by observing847

that genres can be easily predicted by low-level features. For848

instance, action movies have shorter scenes and shot lengths,849

than other movies. Therefore, a correlation exists between850

low-level and high-level features that leads to collinearities851

and reduced prediction capabilities of the mixed approach.852

When using a combination of two or more low-level fea-853

tures, the quality of recommendations does not increase854

significantly and, in same cases, decreases, although it is855

always better than the quality obtained with high-level856

features. This behavior is not surprising, considering that857

the low-level features are weakly correlated, as shown in858

Fig. 6.859

7.2 Trailers vs. Movies860

One of the potential drawbacks in using low-level visual fea-861

tures is the computational load required for the extraction of862

features from full-length movies.863

Our research shows that low-level features extracted from864

movie trailers are strongly correlated with the corresponding865

features extracted from full-length movies (average cosine866

similarity 0.78). Scene duration, camera motion and light are867

the most similar features when comparing trailers with full-868

length movies. The result for the scene duration is somehow869

surprising, as we would expect scenes in trailers to be, on870

average, shorter than scenes in the corresponding full movies.871

However, the strong correlation suggests that trailers have872

consistently shorter shots than full movies. For instance, if an873

action movie has, on average, shorter scenes than a dramatic874

movie, the same applies to their trailers.875

Our results provide empirical evidence that876

low-level visual features extracted from trailers can be877

used as an alternative to features extracted from full-878

length movies in building content-based recommender879

systems.880

8 Conclusion and Future Work881

In this paper, we have presented a novel content-based882

method for the video recommendation task. The method883

extracts and uses the low-level visual features from video884

content to provide users with personalized recommendations,885

without relying on any high-level semantic features—such 886

as, genre, cast, or reviews—that are more costly to collect, 887

because they require an “editorial” effort, and are not avail- 888

able in many new item scenarios. 889

We have developed a main research hypothesis, i.e., a 890

proper extraction of low-level visual features from videos 891

may led to higher accuracy of video recommendations than 892

the typical expert annotation method. Based on a large 893

number of experiments, we have successfully verified the 894

hypothesis showing that the recommendation accuracy is 895

higher when using the considered low-level visual features 896

than when high-level genre data are employed. 897

The findings of our study do not diminish the impor- 898

tance of explicit semantic features (such as genre, cast, 899

director, tags) in content-based recommender systems. Still, 900

our results provide a powerful argument for exploring more 901

systematically the role of low-level features automatically 902

extracted from video content and for exploring them. 903

Our future work can be extended in a number of challeng- 904

ing directions: 905

• We will widen our analysis by adopting bigger and differ- 906

ent datasets, to provide a more robust statistical support 907

to our finding. 908

• We will investigate the impact of using different content- 909

based recommendation algorithms, such as those based 910

on Latent-Semantic-Analysis, when adopting low-level 911

features. 912

• We will extend the range of visual features extracted and 913

we will also include audio features. 914

• We will analyze recommender systems based on low- 915

level features not only in terms of accuracy, but also in 916

terms of perceived novelty and diversity, with a set of 917

online user studies. 918

Acknowledgments This work is supported by Telecom Italia S.p.A., 919

Open Innovation Department, Joint Open Lab S-Cube, Milan. 920

References 921

1. Ricci F, Rokach L, Shapira B (2011) Introduction to recommender 922

systems handbook. In: Ricci F, Rokach L, Shapira B, Kantor PB 923

(eds) Recommender Systems Handbook. Springer, pp 1–35 924

2. Adomavicius G, Tuzhilin A (2005) Toward the next generation of 925

recommender systems: a survey of the state-of-the-art and possible 926

extensions. IEEE Trans Knowl Data Eng 17(6):734–749 927

3. Burke R (2002) Hybrid recommender systems: Survey and 928

experiments. User Model User Adapt Interact 12(4):331–370. 929

/papers/burke-umuai-ip-2002.pdf 930

4. Su X, Khoshgoftaar TM (2009) A survey of collaborative filtering 931

techniques. Adv Artif Intell 2009:4 6932

5. Balabanovic M, Shoham Y (1997) Fab: Content-based, collabora- 933

tive recommendation. Commun ACM 40(3):66–72 934

123


Au

tho

r P

ro

of

unco

rrec

ted

pro

of

Y. Deldjoo et al.

6. Lops P, De Gemmis M, Semeraro G (2011) Content-based rec-935

ommender systems: State of the art and trends. In: Recommender936

systems handbook. Springer, pp 73–105937

7. Pazzani MJ, Billsus D (2007) The adaptive web. chap. Content-938

based Recommendation Systems, pp 325–341. Springer-Verlag,939

Berlin, Heidelberg. http://dl.acm.org/citation.cfm?id=1768197.940

1768209941

8. Zettl H (2002) Essentials of applied media aesthetics. In: Dorai C,942

Venkatesh S (eds) Media Computing, The Springer International943

Series in Video Computing, vol 4. Springer, New York, pp 11–387 944

9. Cremonesi P, Koren Y, Turrin R (2010) Performance of recom-945

mender algorithms on top-n recommendation tasks. In: Proceed-946

ings of the 2010 ACM Conference on Recommender Systems,947

RecSys 2010, Barcelona, Spain, September 26–30, 2010, pp 39–46948

10. Deshpande M, Karypis G (2004) Item-based top-n recommenda-949

tion algorithms. ACM Trans Inf Syst (TOIS) 22(1):143–177950

11. Yang B, Mei T, Hua XS, Yang L, Yang SQ, Li M (2007) Online951

video recommendation based on multimodal fusion and relevance952

feedback. In: Proceedings of the 6th ACM international conference953

on Image and video retrieval ACM, pp 73–80954

12. Zhao X, Li G, Wang M, Yuan J, Zha ZJ, Li Z, Chua TS (2011) Inte-955

grating rich information for video recommendation with multi-task956

rank aggregation. In: Proceedings of the 19th ACM international957

conference on Multimedia ACM, pp. 1521–1524958

13. Elahi M, Ricci F, Rubens N (2013) Active learning strategies for959

rating elicitation in collaborative filtering: a system-wide perspec-960

tive. ACM Trans Intell Syst Technol (TIST) 5(1):13961

14. Billsus D, Pazzani MJ (1999) A hybrid user model for news story962

classification. Springer, New York8 963

15. Kelly D, Teevan J (2003) Implicit feedback for inferring user pref-964

erence: a bibliography. In: ACM SIGIR Forum ACM, vol 37, pp965

18–28966

16. Musto C, Narducci F, Lops P, Semeraro G, de Gemmis M, Barbieri967

M, Korst J, Pronk V, Clout R (2012) Enhanced semantic tv-show968

representation for personalized electronic program guides. In: User969

Modeling, Adaptation, and Personalization. Springer, pp 188–199970

17. Degemmis M, Lops P, Semeraro G (2007) A content-collaborative971

recommender that exploits wordnet-based user profiles for neigh-972

borhood formation. User Model User Adapt Interact 17(3):217–973

255974

18. Eirinaki M, Vazirgiannis M, Varlamis I (2003) Sewep: using site975

semantics and a taxonomy to enhance the web personalization976

process. In: Proceedings of the ninth ACM SIGKDD international977

conference on Knowledge discovery and data mining ACM, pp978

99–108979

19. Magnini B, Strapparava C (2001) Improving user modelling with980

content-based techniques. In: User Modeling 2001. Springer, pp981

74–83982

20. Mooney RJ, Roy L (2000) Content-based book recommending983

using learning for text categorization. In: Proceedings of the fifth984

ACM conference on Digital libraries ACM, pp 195–204985

21. Ahn JW, Brusilovsky P, Grady J, He D, Syn SY (2007) Open user986

profiles for adaptive news systems: help or harm? In: Proceedings987

of the 16th international conference on World Wide Web ACM, pp988

11–20989

22. Billsus D, Pazzani MJ (2000) User modeling for adaptive news990

access. User Model User Adapt Interact 10(2–3):147–180991

23. Cantador I, Szomszor M, Alani H, Fernández M, Castells P (2008)992

Enriching ontological user profiles with tagging history for multi-993

domain recommendations994

24. Middleton SE, Shadbolt NR, De Roure DC (2004) Ontological user995

profiling in recommender systems. ACM Trans Inf Syst (TOIS)996

22(1):54–88997

25. Deldjoo Y, Elahi M, Quadrana M, Cremonesi P (2015) Toward998

building a content-based video recommendation system based999

on low-level features. In: E-Commerce and Web Technologies. 1000

Springer 1001

26. Deldjoo Y, Elahi M, Quadrana M, Cremonesi P, Garzotto F (2015) 1002

Toward effective movie recommendations based on mise-en-scène 1003

film styles. In: Proceedings of the 11th Biannual Conference on 1004

Italian SIGCHI Chapter ACM, pp 162–165 1005

27. Bogdanov D, Herrera P (2011) How much metadata do we need in 1006

music recommendation? A subjective evaluation using preference 1007

sets. In: ISMIR, pp 97–102 1008

28. Bogdanov D, Serrà J, Wack N, Herrera P, Serra X (2011) Unifying 1009

low-level and high-level music similarity measures. IEEE Trans 1010

Multimed 13(4):687–701 1011

29. Knees P, Pohle T, Schedl M, Widmer G (2007) A music search 1012

engine built upon audio-based and web-based similarity measures. 1013

In: Proceedings of the 30th annual international ACM SIGIR con- 1014

ference on Research and development in information retrieval 1015

ACM, pp 447–454 1016

30. Seyerlehner K, Schedl M, Pohle T, Knees P (2010) Using block- 1017

level features for genre classification, tag classification and music 1018

similarity estimation. Submission to Audio Music Similarity and 1019

Retrieval Task of MIREX 2010 1020

31. Canini L, Benini S, Leonardi R (2013) Affective recommendation 1021

of movies based on selected connotative features. IEEE Trans Cir- 1022

cuits Syst Video Technol 23(4):636–647 1023

32. Lehinevych T, Kokkinis-Ntrenis N, Siantikos G, Dogruöz AS, 1024

Giannakopoulos T, Konstantopoulos S (2014) Discovering simi- 1025

larities for content-based recommendation and browsing in multi- 1026

media collections. In: IEEE 2014 Tenth International Conference 1027

on Signal-Image Technology and Internet-Based Systems (SITIS), 1028

pp 237–243 1029

33. Rubens N, Elahi M, Sugiyama M, Kaplan D (2015) Active learning 1030

in recommender systems. In: Recommender Systems Handbook. 1031

Springer, pp 809–846 1032

34. Davidson J, Liebald B, Liu J, Nandy P, Van Vleet T, Gargi U, 1033

Gupta S, He Y, Lambert M, Livingston B et al (2010) The youtube 1034

video recommendation system. In: Proceedings of the fourth ACM 1035

conference on Recommender systems ACM, pp 293–296 1036

35. Wang Y, Xing C, Zhou L (2006) Video semantic models: survey 1037

and evaluation. Int J Comput Sci Netw Secur 6:10–20 1038

36. Hu W, Xie N, Li L, Zeng X, Maybank S (2011) A survey on visual 1039

content-based video indexing and retrieval. IEEE Trans Syst Man 1040

Cybern Part C Appl Rev 41(6):797–819 1041

37. Lew MS, Sebe N, Djeraba C, Jain R (2006) Content-based multi- 1042

media information retrieval: State of the art and challenges. ACM 1043

Trans Multimed Comput Commun Appl (TOMM) 2(1):1–19 1044

38. Rasheed Z, Sheikh Y, Shah M (2005) On the use of computable 1045

features for film classification. IEEE Trans Circuits Syst Video 1046

Technol 15(1):52–64 1047

39. Brezeale D, Cook DJ (2008) Automatic video classification: a sur- 1048

vey of the literature. IEEE Trans Syst Man Cybern Part C Appl 1049

Rev 38(3):416–430 1050

40. Rasheed Z, Shah M (2003) Video categorization using semantics 1051

and semiotics. In: Video mining. Springer, pp 185–217 1052

41. Zhou H, Hermans T, Karandikar AV, Rehg JM (2010) Movie genre 1053

classification via scene categorization. In: Proceedings of the inter- 1054

national conference on Multimedia ACM, pp 747–750 1055

42. Dorai C, Venkatesh S (2001) Computational media aesthetics: 1056

Finding meaning beautiful. IEEE Multimed 8(4):10–12 1057

43. Buckland W (2008) What does the statistical style analysis of film 1058

involve? A review of moving into pictures. more on film history, 1059

style, and analysis. Lit Linguist Comput 23(2):219–230 1060

44. Valdez P, Mehrabian A (1994) Effects of color on emotions. Journal 1061

of Experimental Psychology: General 123(4):394 1062

45. Wang HL, Cheong LF (2006) Affective understanding in film. IEEE 1063

Trans Circuits Syst Video Technol 16(6):689–704 1064

123


Au

tho

r P

ro

of

http://dl.acm.org/citation.cfm?id=1768197.1768209

http://dl.acm.org/citation.cfm?id=1768197.1768209

unco

rrec

ted

pro

of


46. Choros K (2009) Video shot selection and content-based scene1065

detection for automatic classification of tv sports news. In: Tkacz E,1066

Kapczynski A (eds) Internet Technical Development and Applica-1067

tions, Advances in Intelligent and Soft Computing, vol 64. Springer,1068

Berlin Heidelberg, pp 73–801069

47. Barron JL, Fleet DJ, Beauchemin SS (1994) Performance of optical1070

flow techniques. Int J Comput Vis 12(1):43–771071

48. Horn BK, Schunck BG (1981) Determining optical flow. In: 19811072

Technical Symposium East. International Society for Optics and1073

Photonics, pp 319–3311074

49. Tkalcic M, Tasic JF (2003) Colour spaces: perceptual, historical1075

and applicational background. In: EUROCON 2003. Computer as1076

a Tool. The IEEE Region 8, vol 1, pp 304–3081077

50. Datasets–grouplens. http://grouplens.org/datasets/. Accessed: 1078

2015-05-01 1079

51. Youtube. http://www.youtube.com. Accessed: 2015-04-01 1080

52. Kohavi R (1995) The power of decision tables. In: 8th European 1081

Conference on Machine Learning. Springer, pp 174–189 1082

53. Kohavi R, Sommerfield D (1998) Targeting business users with 1083

decision table classifiers. In: KDD, pp 249–253 1084

54. Freitas AA (2014) Comprehensible classification models: a posi- 1085

tion paper. ACM SIGKDD Explor Newsl 15(1):1–10 1086

55. Guyon I, Matic N, Vapnik V et al (1996) Discovering informative 1087

patterns and data cleaning 91088

123


Au

tho

r P

ro

of

http://grouplens.org/datasets/

http://www.youtube.com

Documents

Content-Based Video Recommendation System Based on Stylistic