Feature Article Semantics in Visual Information Retrievalivizlab.sfu.ca/arya/Papers/IEEE/Multimedia/1999/July... · 2009-05-12 · Feature Article. tator and the user are different

A compositionalapproach increasesthe level ofrepresentation thatcan be automaticallyextracted and used ina visual informationretrieval system.Visual information atthe perceptual levelis aggregatedaccording to a set ofrules. These rulesreflect the specificcontext andtransform perceptualwords into phrasescapturing pictorialcontent at a higher,and closer to thehuman, semanticlevel.

Visual information retrieval systemshave entered a new era. First-generation systems allowed access toimages and videos through textual

data.1,2 Typical searches for these systems include,for example, “all images of paintings of theFlorentine school of the 15th century” or “allimages by Cezanne with landscapes.” Such sys-tems expressed information through alphanu-meric keywords or scripts. They employedrepresentation schemes like relational models,frame models, and object-oriented models. On theother hand, current-generation retrieval systemssupport full retrieval by visual content.3,4 Accessto visual information is not only performed at aconceptual level, using keywords as in the textu-al domain, but also at a perceptual level, usingobjective measurements of visual content. Inthese systems, image processing, pattern recogni-tion, and computer vision constitute an integralpart of the system’s architecture and operation.They objectively analyze pixel distribution andextract the content descriptors automatically fromraw sensory data. Image content descriptors arecommonly represented as feature vectors, whoseelements correspond to significant parametersthat model image attributes. Therefore, visualattributes are regarded as points in a multidimen-sional feature space, where point closeness reflectsfeature similarity.

These advances (for comprehensive reviews of

the field, see the “Further Reading” sidebar) havepaved the way for third-generation systems, fea-turing full multimedia data management and net-working support. Forthcoming standards such asMPEG-4 and MPEG-7 (see the Nack and Lindsayarticle in this issue) provide the framework for effi-cient representation, processing, and retrieval ofvisual information.

Yet many problems must still be addressed andsolved before these technologies can emerge. Animportant issue is the design of indexing struc-tures for efficient retrieval from large, possibly dis-tributed, multimedia data repositories. To achievethis goal, image and video content descriptors canbe internally organized and accessed throughmultidimensional index structures.5 A second keyproblem is to bridge the semantic gap between thesystem and users. That is, devise representationscapturing visual content at high semantic levelsespecially relevant for retrieval tasks. Specifically,automatically obtaining a representation of high-level visual content remains an open issue.Virtually all the systems based on automatic stor-age and retrieval of visual information proposedso far use low-level perceptual representations ofpictorial data, which have limited semantics.

Building up a representation proves tanta-mount to defining a model of the world, possiblythrough a formal description language, whosesemantics capture only a few significant aspects ofthe information content.6

Different languages and semantics inducediverse world representations. In text, for exam-ple, the meaning of single words is specific yetlimited, and an aggregate of several words—aphrase—produces a higher degree of significanceand expressivity. Hence, the rules for the syntac-tic composition of signs in a given language alsogenerate a new world representation, offeringricher semantics in the hierarchy of signification.

To avoid equivocation, a retrieval systemshould embed a semantic level reflecting as muchas possible the one humans refer to during inter-rogation. The most common way to enrich a visu-al information retrieval system’s semantics is toannotate pictorial information manually at stor-age time through a set of external keywordsdescribing the pictorial content. Unfortunately,textual annotation has several problems:

1. It’s too expensive to go through manual anno-tation with large databases.

2. Annotation is subjective (generally, the anno-

38 1070-986X/99/$10.00 © 1999 IEEE

Semantics inVisualInformationRetrieval

Carlo Colombo, Alberto Del Bimbo, and Pietro PalaUniversity of Florence, Italy

Feature Article

tator and the user are different persons).

3. Keywords typically don’t support retrieval bysimilarity.

Automatically increasing the semantic level ofrepresentation provides an alternative. Startingfrom perceptual features—the atomic elements ofvisual information—some intermediate semanticlevels can be extracted using a suitable set of rules.Perceptual features represent the evidence uponwhich to build the interpretation of visual data. Aprocess of syntactic construction called composi-tional semantics builds the semantic representation.

In this article, we discuss how to extract auto-matically—from raw image and video data—twodistinct semantic levels and how to represent theselevels through appropriate language rules. As weorganize semantic levels according to a significa-tion hierarchy, the corresponding description lan-guages become stratified, allowing the compositionof higher semantic levels according to syntacticrules that combine perceptual features and lowerlevel signs. Since these languages directly dependon objective features, the approach naturallyaccommodates visual search by example andretrieval by similarity. We’ll address two differentvisual contexts: art paintings and commercialvideos. We’ll also present retrieval examples show-ing that compositional semantics improves accor-dance with human judgment and expectation.

A language-oriented approachHere we discuss the compositional semantics

framework we developed. We also provide back-ground theories in art and advertising.

Compositional semantics frameworkThe compositional semantics framework

involves a bottom-up analysis and processing of avisual message, starting from its perceptual fea-tures. For still images, these features are image col-ors and edges. For videos or image sequences,additional features include the presence of editingeffects, the motion of objects within a scene, andso on. Without loss of generality, the perceptualproperties of a visual message can be representedthrough a set of scores P = {ϕ i}, i = 1, …, n, eachscore ϕ i∈ [0, 1] representing the extent to whichthe i-th feature appears in the message.

We devised two distinct levels of the significa-tion hierarchy, namely the expressive and the emo-tional levels, as plausible intermediate stepsinvolved in the construction of meaning.

Semantic levels: expression and emotion. Atthe expressive level, perceptual data are organizedinto a group of new features—the expressive fea-tures—taking into account both spatial and tem-poral distributions of perceptual features.Expressive features reflect concepts that humansembody at a higher level of abstraction to achievea more compact visual representation.

Combination rules are modeled as functions Facting over the perceptual feature set P and return-ing a score expressing the degree of truth by whichthe rule F holds. Hence, a rule Fj can be defined as

Fj : [0, 1]n → [0, 1]

Operators of logical composition between rulesextend the signification of the representation. Wedefine these operators as

F1 ∧ F2 = min(F1, F2) F1 ∨ F2 = max(F1, F2)

The expressive feature set F = {F1, …, Fm} qualifiesthe content of the visual message at the expressivelevel.

A musical example lets us capture the distinc-tion between the expressive and emotional levels.Assume that the laws of harmony and counter-point characterize music composition at theexpressive level. However, following these rulesdoesn’t guarantee a pleasant result for the audi-ence. In other words, musical fruition and under-standing involves adopting aesthetic criteria thatgo beyond expression—that is, the syntax ofmeaning—and reach emotion as the ultimatesemantics of meaning. (The art of J.S. Bach pro-vides a remarkable example of how to reach musi-

39

July–Septem

ber 1999

Further ReadingFor a comprehensive introduction to visual information retrieval, see

A. Del Bimbo, Visual Information Retrieval, Academic Press, London, 1999.

P. Aigrain, H. Zhang, and D. Petkovic, “Content-Based Representation and

Retrieval of Visual Media: A State-of-the-Art Review,” Multimedia Tools and

Applications, Vol. 3, No. 4, Dec. 1996, pp. 179-202.

A. Gupta and R. Jain, “Visual Information Retrieval,” Comm. of the ACM, Vol. 40,

No. 5, May 1997, pp. 71-79.

A review of the state of the art in visual information processing can befound in

B. Furht, S.W. Smoliar, and H.J. Zhang, Video and Image Processing in Multimedia

Systems, Kluwer Academic Publishers, Boston, 1996.

cal beauty by a skillful adherence to the formalrules of 18th century music.) We formally con-struct the emotional level, at the top of our signi-fication hierarchy, from the levels below—namely,the expressive and perceptual levels. With a nota-tion similar to the one above, rules at the emo-tional level are represented through functions Gacting over the set of perceptual and expressivefeatures P × F and returning a score that expressesthe degree of truth by which the rule G holds.Hence, a rule Gk can be defined as

Gk : [0, 1]n+m → [0, 1]

Operators of logical composition between rulescan extend the representation’s semantics. The setG = {Gk} qualifies the content of the visual messageat the emotional level.

Perceptual, expressive, and emotional featuresqualify the meaning of a visual message at differ-ent levels of signification. For all these levels, con-struction rules depend on the specific data domainto which they refer (such as movies, commercials,and TV news for videos; paintings, photographs,and trademarks for still images). Specifically, theexpressive level features objective properties thatgenerally depend on collective cultural back-grounds. The emotional level, on the contrary,relies on subjective elements such as individualcultural background and psychological moods.

Background theoriesHere we present theories that provide a refer-

ence framework for developing expressive andemotional rules in the domains of art images andcommercial videos.

Expression in art images. Among the manyauthors who recently addressed the psychology ofart images, Arnheim discussed the relationshipsbetween artistic form and perceptive processes,7

and Itten8 formulated a theory about use of colorin art and about the semantics it induces. Ittenobserved that color combinations induce effectssuch as harmony, disharmony, calmness, andexcitement that artists consciously exploit in theirpaintings. Most of these effects relate to high-levelchromatic patterns rather than to physical prop-erties of single points of color (see, for example,Figure 1a). Such rules describe art paintings at theexpressive level. The theory characterizes colorsaccording to the categories of hue, luminance,and saturation. Twelve fundamental hues are cho-sen and each of them is varied through five levels

40

IEEE

Mul

tiM

edia

Equatorial section Longitudinal section

External views

θρ

φ

Geographic coordinates

(a)

(b)

(c)

Orange-yellowYellow

Green

Pure color circle

The four colors accordance

The three colors accordance

Green-yellow

Green-blue

Orange

RedBlue

Purple-redPurple-bluePurple

Orange-red

Figure 1. (a) Contrasts

and colors in an art

image. (b) The Itten

sphere. (c) The polygons

generating three- and

four-chromatic

accordance.

of luminance and three levels of saturation. Thesecolors are arranged in a chromatic sphere calledthe Itten sphere (Figure 1b), such that perceptualcontrasting colors have opposite coordinates withrespect to the center of the sphere.

Analyzing the polar reference system, we canidentify four different types of color contrasts:pure colors, light-dark, warm-cold, and saturated-unsaturated (quality). Psychological studies havesuggested that, in western culture, red-orangeenvironments induce a sense of warmth (yellowthrough red-purple are warm colors). Conversely,green-blue conveys a sensation of cold (yellow-green through purple are cold colors). Cold sen-sations can be emphasized by the contrast with awarm color or dampened by its coupling with ahighly cold tint. The concept of harmonic accor-dance combines hues and tones to generate a sta-bility effect onto the human eye. Harmony can beachieved by creating color accordances, generat-ed on the sphere by connecting locations throughregular polygons (see Figure 1c).

Emotion in art images. Color combinationsalso induce emotional effects. The mapping oflow-level color primitives into emotions is quitecomplex. It must consider theories about the useof colors and cognitive models, and involve cul-tural and anthropological backgrounds. In otherwords, people with different cultures will likelyshare the same expressive level representation butnot perceive the same emotions in front of thesame color pattern.

Artists use color combinations unconsciouslyand consciously to produce optical and psycho-logical sensations.9 Warm colors attract the eyeand grab attention of the observer more than coldcolors. Cold sensations provided by a large regionof cold color can be emphasized by the contrastwith a warm color or dampened by its couplingwith a highly cold tint. Similarly, small, close coldregions emphasize large, warm regions. Red com-municates happiness, dynamism, and power.Orange, the warmest color, resembles the color offire and thus communicates glory. Green com-municates calmness and relaxation, and is thecolor of hope. Blue, a cold color, improves thedynamism of warm colors, suggesting gentleness,fairness, faithfulness, and virtue. Purple, a melan-choly color, sometimes communicates fear.Brown generally serves as the background colorfor relaxing scenes. Differences in lightness deter-mine a sense of plasticity and the perception ofdifferent planes of depth. The absence of con-

trasting hues and the presence of a single domi-nant color region inspire a sense of uneasinessstrengthened by the presence of dark yellow andpurple colors. The presence of regions in har-monic accordance communicates a sense of sta-bility and joy. Calmness and quiet can beconveyed through combining complementarycolors. In the presence of two noncomplementarycolors, the human eye looks for the complemen-tary of the observed color. This rouses a sense ofanguish.

Another basic contribution to the semantics ofan image relates to the presence and characteris-tics of notable lines. In fact, line slope is a key fea-ture through which the artist communicatesdifferent emotions. For example, an oblique slopecommunicates dynamism and action, while a flatslope, such as a horizon, communicates calmnessand relaxation. Figure 2 shows an example, illus-trating how saturated colors can combine withsloped lines to communicate a sensation ofdynamism in the observer.

Expression in commercials. Semiotics studiesthe meaning of symbols and signs. In semiotics, asign represents anything that conveys meaningaccording to commonly accepted conventions.Semiotics therefore suggests that signs relate totheir meaning according to the specific culturalbackground. Semioticians usually identify two dis-

41

July–Septem

ber 1999

Figure 2. Some frames taken from a “Sergio Tacchini” commerical video. The

use of colors and lines in each frame communicates dynamism. In this video,

the effect is further enhanced by a high frequency of shots.

Cou

rtes

y of

Med

iase

t

tinct steps for the production of meaning:10

1. an abstract level formed by narrative structures,that is, structures including all those basicsigns that create meaning and those valuesdetermined by sign combinations, and

2. a concrete level formed by discourse structures,describing the way in which the author usesnarrative elements to create a story.

Representing the expressive and emotional levelfor commercial videos inherits some of the fea-tures introduced before for still images and addsnew features related to the way frames are con-catenated. Semiotics classifies advertisingimages—and specifically commercials—into fourdifferent categories (noted below) that relate tothe narrative element.11 Directors will use themain narrative signs of the video—camera breaks,colors, editing effects, rhythm, shot angles, andlines—in a peculiar way, depending on the spe-cific video typology considered.

❚ Practical commercials emphasize the qualitiesof a product according to a common set of values. The advertiser describes the product ina familiar environment so that the audiencenaturally perceives it as useful. Camera takesare usually frontal, and transitions take placein a smooth and natural way. This implieschoosing long dissolves for merging shots, andthe prevalence of horizontal and verticallines—giving the impression of relaxation andsolidity.

❚ Critical commercials introduce a hierarchy ofreference values. The product serves as the sub-ject of the story, which focuses on the produc-t’s qualities through an apparently objectivedescription of its features. The scene has toappear more realistic than reality itself. For thisreason, the commercial has a minimum num-ber of camera breaks. Due to smooth cameramotions, the ever-changing colors in the back-ground draw the audience’s attention to the(constant) color of the product.

❚ Utopic commercials provide evidence that theproduct can succeed in critical tests. Here, thestory doesn’t follow a realistic plot. Rather, sit-uations appear as in a dream. Mythical scenar-ios are chosen to present the product, shownto succeed in critical conditions often in a fan-

tastic and unrealistic way. The director createsa movie-like atmosphere, with a set of domi-nant colors defining a closed-chromatic worldand with all the traditional editing effects (cuts,dissolves) possibly taking place.

❚ Playful commercials emphasize the accordancebetween users’ needs and product qualities.Here a manifest parody of the other typologiesof ads takes place. The commercial clearlystates to the audience that they’re watchingadvertising material. Situations and places vis-ibly differ from everyday life, and they’redeformed in such a caricatural and grotesquefashion that the agreement between productqualities and purchaser’s needs is often shownin an ironic way (such as an old woman dri-ving a Ferrari). The director emphasizes thepresence of the camera in the ad and uses allpossible effects to stimulate the active partici-pation of the audience. Also, everything looksstrange and “false”—use of unnatural colors,improbable camera takes, and so on.

Emotion in commercials. The choice of agiven combination of narrative signs affects boththe expressive and emotional levels of significa-tion. While dissolves communicate calmness andrelaxation, cuts increase the video’s dynamism(see again Figure 2). Sometimes in commercials,directors emphasize cuts by including a whiteframe between the end of the first frame and thebeginning of the second, thus inducing a sense ofshock in the observer. The semantics associatedwith editing effects thus relates to the use of cutsor dissolves and to the length of the shots in thevideo. If the video consists of a few long shots, theeffect is that of calmness and relaxation.Alternatively, videos with many short shots sepa-rated by cuts induce a sense of action anddynamism in the audience.

Content representation and semantics-based retrieval

In the framework of our research, we used thetheories presented in the previous section to qual-ify the content of a visual message at an interme-diate semantic level. Specifically, we’ve exploitedItten’s theory and semiotic principles to representthe visual content at the expressive level. Ourwork supports defining formal rules to qualify theeffects of color, geometric, and dynamic featurecombinations. We’ve also exploited psychologi-cal principles of visual communication to repre-

42

IEEE

Mul

tiM

edia

sent the visual content at the emotional level. Inthe next section, we use the semantic descriptorsautomatically extracted from art images and com-mercial videos for retrieval purposes.

Content representation for art imagesHere we discuss the rules needed to represent

the content of art images.

Expressive level. Exploiting Itten’s model toqualify an image’s chromatic content at an expres-sive level requires

1. segmenting the image into regions character-ized by uniform colors, and

2. representing chromatic and spatial features ofimage regions.

We segment images by looking for clusters in thecolor space, then back-projecting cluster cen-troids in the feature space onto the image.Segmentation occurs through selecting theappropriate color space so that small feature dis-tances correspond to similar colors in the per-ceptual domain. Adopting the InternationalCommission on Illumination (CommissionInternationale de L’Eclairage—CIE) L*u*v*space12 accomplishes this.

Once segmented into regions, the image’sregion features can be described in terms of intra-region and inter-region properties. Intra-regionproperties include region color, warmth, hue,luminance, saturation, position, and size. Inter-region properties consist of hue, saturation,warmth, luminance contrast, and harmony. Tomanage the vagueness of chromatic properties, weuse a fuzzy representation model to describe ageneric property’s value. Therefore, we candescribe a generic property by introducing n ref-erence values for that property. Then, consideringn scores (ϕ1, … ϕn), the i-th score measures theextent to which the region conforms to the i-threference value. For instance, if we introduce threereference values for the luminance of a region(corresponding to dark, medium, and bright), thedescriptor (0.0, 0.1, 0.9) represents a very brightregion’s luminance.

According to Itten’s rules, these abstract prop-erties must be translated into language sentences.In formal language theory notation, region for-mulas (φ) that characterize chromatic and arrange-ment properties of color patches represent thesesentences through

φ := region | hue = λh | lum = λl | sat = λs || warmth = λw | size = λS | position = λp |

| Contrastγ(φ1, φ2) || Harmony (φ1, … ,φn) | φ1 ∧ φ2 | φ1 ∨ φ2

where λγ is a feasible value for the measure γ, withγ ∈ {h, l, s, w} denoting the attributes of hue, lumi-nance, saturation, and warmth, respectively. Wedefine semantic clauses in terms of a fulfillmentrelation |=of a generic formula φon a region R. Thedegree of truth by which φ is verified over R isexpressed by a value ξ that’s computed consider-ing fuzzy descriptors of region properties. We codesemantic clauses into a model-checking engine.Given a generic formula φ and an image I, theengine computes a score representing the degreeof truth by which φ is verified over I (see the side-bar “Model-Checking Engine” on the next page).13

Emotional level. The psychological analysis ofeffects induced by images suggests that we can useonly a limited subset of primary emotions toexpress the great variability of human emotions(secondary emotions). Explicitly defining the rulesmapping expressive and perceptual features ontoemotions would require us to build an overcom-plicated model taking into account cultural andbehavioral issues. Hence, we prefer to determinethe relative relevance of each single featurethrough an adaptation process that fits a specificculture and fashion.

Explicitly, the definition of the rules mappingexpressive and perceptual features onto emotionsinvolves

1. Identifying a set of primary emotions. We identi-fied four primary emotions, namely, action,relaxation, joy, and uneasiness, as the mostrelevant to express the interaction of humanswith images. Referring to the general schemepresented above, secondary emotions such asfear and aggressiveness can be expressed ascombinations of action and uneasiness.

2. Identifying for each primary emotion a set of plau-sible inputs. In our model, we expect contrastsof warmth and contrasts of hue to generate asense of action and dynamism. The presenceof lines with a high slope reinforces this.

3. Defining a training set for the model. Given adatabase of images, we use some images astemplates of primary emotions. This trainingset adapts weights of perceptual features to

43

July–Septem

ber 1999

determine the relative relevance of each fea-ture for the emotion induced in subjects of acertain cultural context.

We summarize the dependence between per-ceptual/expressive features and emotions below.We measure the degree of action communicatedby an image as

where Fwarmth−c and Fhue−c represent the expressivefeatures measuring the presence of warmth andhue contrasts, and ϕslanted the perceptual featuremeasuring the presence of slanted lines (the wi’sare the feature weights). We can measure thedegree of relaxation an image communicates by

where Flum−c represents the expressive features mea-suring the presence of luminance contrasts andϕbrown and ϕgreen the perceptual features measuringthe presence of brown and green regions. Thedegree of joy communicated by an image can bemeasured as

where Fharmony represents the expressive featuresmeasuring the presence of regions in harmonicaccordance. Finally, we measure the degree ofuneasiness an image communicates as

G g F wI Ijjoy harmony 1 = ( , )3

G g F

w w w

I Ic

r r r

relax lum brown green = ( , , ,

, , )

2

1 2 3

− ϕ ϕ

G g F F

w w w

I Ic c

a a a

action warmth hue slanted = ( , , ,

, , )

1

1 2 3

− − ϕ

44

IEEE

Mul

tiM

edia

Model-Checking EngineA model-checking engine decides the satisfaction of a formula φ over an

image I with a two-step process. First, the engine recursively decomposes for-mula φ into subformulas in a top-down manner. This allows the representationof φ with a tree in which leaf nodes represent intra-region specifications.Afterwards, the engine labels regions in the image description with the sub-formulas they satisfy in a bottom-up approach. The engine decides the satis-faction of region formulas by directly referring to the chromatic fuzzydescriptors. This first labeling level is then exploited to decide the satisfactionof composition formulas. In this labeling process, the engine first checks if theregion contains pixels that can satisfy the subformula. If it can, the formulalabels the region with the subformula and with the degree of truth by whichthe region is satisfied. When the degree of truth falls under a given minimumthreshold, the engine drops the image from the candidate list.

In Figure A, the engine computes the degree by which the formula φ =Contrastw (hue = orange, size = large) holds over an image. The engine assumesthat the image has just been segmented. For simplicity, the engine considersonly four regions, namely R1, R2, R3, and R4. First, φ decomposes intosubformulas:

φ3 : Contrastw (φ1, φ2)φ2 : size = largeφ1 : hue = orange

Then, according to a bottom-up approach, the engine labels regions withthe subformulas they satisfy:

Step 1 Label R2 and R4 with φ1

Step 2 Label R1 with φ2

Step 3 Label R1, R2, and R4 with φ3

At the end of this process, the engine labels three regions with φ3, thus theoriginal formula φ is satisfied over the image.

R4

R1

R3

R2

φ = Contrastw (hue = orange, size = large)

φ3: Contrastw (φ1, φ2)φ2: hue = orangeφ1: size = large

φ3

φ1 φ2

Figure A. The model-checking engine. From top to

bottom: formula decomposition, segemented

image, and original image.

where Fn−hue−c represents the expressive featuresmeasuring the absence of contrasts of hue, andϕyellow and ϕpurple the perceptual features measuringthe presence of yellow and purple regions.

Presently we use a linear mapping to model thegi

I functions and a linear regression scheme toachieve adaptation of weights wi based on a set oftraining examples.

Content representation for commercial videosNow we’ll present the content representation

rules for commercial videos.

Expressive level. Semiotic principles permitestablishing a link between the set of perceptu-al features and each of the four semiotic cate-gories of commercial videos (practical, playful,utopic, and critical). This supports organizing adatabase of commercials on the basis of theirsemiotic content. Once a video has been seg-mented into shots, all perceptual features areextracted. Each feature’s value is representedaccording to a fuzzy representation model. Ascore in [0, 1] is introduced for each feature toqualify the feature’s presence in the video. Inter-frame features address

❚ the presence of cuts (ϕcuts) and dissolves (ϕdissolves),

❚ the presence or absence of colors recurring inmany frames (ϕrecurrent ~1 and ϕrecurrent ~0, respec-tively), and

❚ the presence or absence of editing effects (ϕedit

~1 and ϕedit ~0, respectively).

Intra-frame features address the presence of

❚ horizontal and vertical or slanted lines (ϕhor/vert

~1 and ϕhor/vert ~0, respectively) and

❚ saturated or unsaturated colors (ϕsaturated ~1 andϕsaturated ~0, respectively).

Table 1 summarizes how perceptual featuresare mapped onto the four semiotic categories. Weadopted a linear mapping where table entries indi-cate expected feature weights and “×” indicatesirrelevance.

We represent the video content through sen-tences qualifying the presence of critical, utopic,

practical, and playful features. These sentences areexpressed through formulas (Φ) defined as

Φ := Fpractical ≥ k1 | Fplayful ≥ k2 | Futopic ≥ k3

|Fcritical ≥ k4 | Φ1 ∧ Φ2 | Φ1 ∨ Φ2

where the ki’s represent fixed threshold values.

Emotional level. High-level properties of acommercial relate to the feelings it inspires in theobserver. We’ve organized these characteristics ina hierarchical fashion. A first classification sepa-rates commercials that induce action from thosethat induce quietness. Each class then furthersplits to specify the kind of action and quietness.Explicitly, action in a commercial can induce twodifferent feelings—suspense and excitement.Similarly, quietness can be further specified asrelaxation and happiness.

In the following, we describe the mappingbetween emotional features and low-level perceptu-al features reflecting reference principles of visualcommunication. (See also Table 2 on the next page.)

❚ Action: The degree of action for a videocan be improved by the presence of red andpurple. Action videos are often characterized byframings exhibiting high slanting. Shortsequences are joined by cuts. The main distin-guishing feature of these videos lies in the pres-ence of a high degree of motion in the scene.

❚ Excitement: Videos that communicate excite-ment (with degree GV

excite) typically feature shortsequences joined through cuts.

❚ Suspense: To improve the degree of suspenseGV

suspense in a video that communicates action,directors combine both long (ϕ long) and short(ϕshort) sequences in the video and join themthrough frequent cuts.

GVaction

G g F

w w w

I In c

u u u

uneas hue yellow purple = ( , , ,

, , )

4

1 2 3

− − ϕ ϕ

45

July–Septem

ber 1999

Table 1. Perceptual mapping onto semiotic categories.

Semiotic Categories Perceptual Features Fpractical Fplayful Futopic Fcritical

ϕsaturated ~0 ~1 × ×ϕrecurrent × × ~1 ~0

ϕhor/vert ~1 ~0 × ~1

ϕcuts × ~1 ~1 ×ϕdissolves ~1 × ~1 ×ϕedit × × × ~0

❚ Quietness: The degree of quietness GVquiet for a

video can be improved by the presence of blue,orange, green, and white colors, and loweredby the presence of black and purple. Quietvideos feature horizontal framings. A few longsequences might be present, possibly joinedthrough dissolves.

❚ Relaxation: Videos that communicate relax-ation (with degree ) don’t show relevantmotion components.

❚ Happiness: Videos that communicate happiness(with degree GV

happy) share the same features asquiet videos but also exhibit a relevant motioncomponent.

Presently, we use a linear approximation tomodel emotional feature mapping, constructed byweight adaptation according to a linear regressionscheme. Following the above scheme, the extentto which a video k conforms to one of the sixclasses is computed through six scores{GV

j (k)}61 ,

each representing a weighted combination of low-level video features. To achieve good discrimina-tion between the six classes of commercialsrequires that for a generic video belonging to cat-egory j, the ratio GV

j /GVm must be high, at least for

all the categories not a generalization of categoryj (we call these categories j-opponent categories).

Model validation and retrieval resultsAt the Visual Information Processing Lab of the

University of Florence, we’ve used the composi-

tional semantics framework expounded earlier todevelop systems for retrieving still images andvideos based on both expressive and emotionalcontent.13,14 We discuss the method used in thesesystems for extracting perceptual features fromraw visual data in the sidebar “Perceptual Featuresfor Still Images and Videos.”

In this section, we present examples of retriev-ing art images and commercial videos accordingto the expressive level. You can find retrievalbased on emotional features elsewhere.15 We eval-uated retrieval performance in terms of effective-ness, that is, a measure of the agreement betweenhuman evaluators and the system in ranking a testset of images according to their similarity to aquery. Measuring effectiveness reliably requires asmall image test set (typically 20 to 50 images).Given a sample query, we evaluated the agree-ment as the percentage of human evaluators whorank images in the same (or very close) position asthe system does. We define effectiveness as

where i is an image from the test set, j denotes thesample query, and σj(i) is the width of the windowcentered in the rank Pj(i) assigned by the system.Qj(i, k) is the percentage of people who ranked thei-th image in a position between Pj(i) − σj(i) andPj(i) + σj(i).

S i Q i kj j

k P i i

k P i i

j j

j j

( ) ( , )( ) ( )

( ) ( )

== −

= +

∑σ

σ

GVrelax

46

IEEE

Mul

tiM

edia

Table 2. Perceptual mapping onto emotional features.

Emotion Categories Perceptual Features Action Excitement Suspense Quietness Relaxation Happiness

ϕdissolves × × × ~1 ~1 ~1

ϕcuts ~1 ~1 ~1 ~1 ~1 ~1

ϕlong × × ~1 ~1 ~1 ~1

ϕshort × ~1 ~1 ~1 ~1 ~1

ϕmotion ~1 ~1 ~1 × ~0 ~1

ϕhor/vert ~0 ~0 ~0 ~1 ~1 ~1

ϕred ~1 ~1 ~1 × × ×ϕorange × × × ~1 ~1 ~1

ϕgreen × × × ~1 ~1 ~1

ϕblue × × × ~1 ~1 ~1

ϕpurple ~1 ~1 ~1 ~0 ~0 ~0

ϕwhite × × × ~1 ~1 ~1

ϕblack × × × ~ 0 ~ 0 ~ 0

“×” indicates irrelevance

47

Here we discuss the perceptual features for stillimages and videos.

Still imagesAn image’s semantics relates to its color con-

tent and the presence of elements such as linesthat induce dynamism and action.

ColorsColor cluster analysis helps segment images into

color regions. We can obtain clustering in 3D spaceby using an improved version of the standard K-means algorithm, which avoids converging withnonoptimal solutions. This algorithm uses competi-tive learning as the basic technique for groupingpoints in the color space. An image’s chromatic con-tent can be expressed using a set of eight numbersnormalized in [0, 1] denoting one color out of theset {red, orange, yellow, green, blue, purple, white,and black}. Each number quantifies the presence inthe image of a region exhibiting the i-th color.

LinesDetecting significant line slopes in an image can

be accomplished by using the Hough transform togenerate a line slope histogram (see Figure B). Thefeature ϕhor/vert ∈ [0, 1] gives the ratio of horizontaland vertical lines with respect to the overall num-ber of lines in the image.

VideosVideo analysis primarily aims to segment video,

that is, to identify each shot’s start and end pointsand the video’s characterization through its mostrepresentative keyframes. Once a video has beenfragmented into shots and video editing featureshave been extracted, each shot’s content can be

internally described by segmenting each shot’skeyframe as described in the previous section ofthis sidebar.

CutsRapid motion in the scene and sudden changes

in lighting yield low correlation between contigu-ous frames, especially in cases adopting a hightemporal subsampling rate. To avoid false cutdetection, we studied a metric insensitive to suchvariations while reliably detecting true cuts. Wepartition each frame into nine subframes and rep-resent each subframe by considering the color his-tograms in the hue, saturation, intensity (HSI) colorspace. We detect cuts by considering the volumeof the difference of subframe histograms in twoconsecutive frames. The presence of a cut at framei can be detected by thresholding the average vol-ume value of for the nine subframes. After repeat-ing the above procedure for all frames i = 1 …#frames, we simply obtain the overall feature relat-ed to the presence of cuts in a video as ϕcuts =#cuts/#frames, where ϕcuts ∈ [0, 1].

DissolvesThe dissolve effect merges two sequences by

partly overlapping them. Detecting dissolves incommercials proves particularly difficult becausedissolves typically occur in a limited number ofconsecutive frames. Due to this peculiarity, exist-ing approaches to detect dissolves (developed formovies) have shown poor performance. We useinstead corner statistics to detect dissolves. Duringthe editing effect, corners associated with the firstsequence gradually disappear and those associat-ed with the second sequence gradually appear.

Perceptual Features for Still Images and Videos

’Stacchini.mv.L2’

0 45 90 135 180 225 270

80

70

60

50

40

30

20

10

0

(1) (2) (3)

Occ

uren

ces

Slope

Figure B. Perceptual

features for still images.

From left to right: video

frame, computed edges,

and line slope

histogram.continued on p. 48

Image retrievalFor art image retrieval, we used a test set of 40

images, including Renaissance to contemporarypainters. We measured retrieval effectiveness withreference to four different queries, addressing con-trasts of luminance, warmth, saturation, andharmony.

We collected answers from 35 experts in thefine arts field. We asked them to rank databaseimages according to each reference query. Figure3 shows the test database images. Figure 4 showsthe eight sets of top-ranked images the systemretrieved in response to the four reference queries.(For more information about these images visithttp://www.cineca.it/wm/.)

Figure 5 shows the plots of effectiveness S as afunction of rankings. They show the agreementbetween the people interviewed and the system inranking images according to the queries. Figure 5shows only rankings from 1 to 8, since they rep-resent agreement on the most representativeimages. The plots show a very large agreement

48

This yields a local minimum in the number of cor-ners detected during the dissolve (see Figure C).An image corner is characterized by large and dis-tinct values of the gradient auto-correlationmatrix’s eigenvalues. We evaluate the featureϕdissolves ∈ [0, 1] as ϕdissolves = #dissolves/#frames.

MotionWe analyze motion by tracking corners in a

sequence. For each shot, we compute a featureϕmotion ∈ [0, 1] that represents the average intensi-ty of motion. ϕmotion = 0 means that motion isabsent during the sequence. Higher values of ϕmotion indicate the presence of increasingly relevantmotion components.

Inter-shot featuresWe represent color-related inter-shot features

used in a video as ϕrecurrent ∈ [0, 1] (expressing the

ratio between colors that recur in a high percent-age of keyframes and the overall number of signif-icant colors in the video sequence) and ϕsaturated

(expressing the relative presence of saturated col-ors in the scene). Another relevant inter-shot char-acteristic—the rhythm of a sequence ofshots—relates to shot duration and to the use ofcuts and dissolves to join shots. We define therhythm r(i1, i2) ∈ [0, 1] of a video sequence over aframe interval [i1, i2] as

where #cuts and #dissolves are measured in thesame interval. A simple feature measuring an entiresequence’s internal rhythm is the average rhythm,which relates to the overall number of breaks ϕedit =r(1, #frames).

r i ii i

( , )2 22 1 1

=+− +

# cuts # dissolves

(1) (2) (3) (4)

Figure C. Corners

detected during a

dissolve effect.

continued from p. 47

Cou

rtes

y of

Web

Mus

eum

Figure 3. Images used to test system effectiveness for the representation of

expressive content.

between the experts and the system in assigningsimilarity rankings.

Figure 6 shows an example of retrieval accord-ing to the expressive level using a database of 477images representing paintings from the 15th tothe 20th century. Two dialog boxes define prop-erties (hue and dimension) of the two sketched

regions of Figure 6. Theuser may define thedegree of truth (60 per-cent) assigned to the

query. Figure 6 (right) shows the retrieved paint-ings. The 12 best-matched images all feature largeregions with contrasting luminance. The top-ranked image represents an outstanding exampleof luminance contrast, featuring a black regionover a white background. Images in the second,third, fifth, sixth, and seventh positions exempli-

49

July–Septem

ber 1999

(d)

(c)

(b)

(a)

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

Figure 4. Top-ranked images according to queries for contrast of luminance (a),

contrast of saturation (b), contrast of warmth (c), and harmonic accordance (d).

0

20

4 5 6 7 8 93

40

60

80

100

0 1 2

0

20

4 5 6 7 8 93

40

60

80

100

0 1 2

0

20

4 5 6 7 8 93

40

60

80

100

0 1 2

0

20

4 5 6 7 8 93

40

60

80

100

0 1 2

(a)

(b)

(c)

(d)

Average agreement

Agr

eem

ent

Retrieval ranking

Average agreement

Agr

eem

ent

Agr

eem

ent

Retrieval ranking

Average agreement

Agr

eem

ent

Retrieval ranking

Average agreement

Retrieval ranking

Figure 5. System effectiveness measured against

experts for four different sample queries: constrast

of luminance (a), constrast of saturation (b), con-

trast of warmth (c), and harmonic accordance (d).

Figure 6. Results of a query for images with two large regions showing

constrasting luminance.

Cou

rtes

y of

Web

Mus

eum

fy how contrast ofluminance betweenlarge regions can beused to convey the per-ception of differentplanes of depth.

Video retrievalWe evaluated video retrieval effectiveness

using a test set of 20 commercial videos. A teamof five experts in the semiotic and marketing fieldsclassified each video in terms of the four semioticcategories. We used the common representationof commercials categories based on the semioticsquare. This instrument, first introduced byGreimas,10 combines pairs of four semiotic objectswith the same semantic level, according to threebasic relationships: opposition, completion, andcontradiction. Objects placed at opposite sides ofthe square are complementary. Figure 7a showsthe practical-playful and critical-utopic diagonalsof the semiotic square as coordinate axes. Weasked the experts to classify the commercials byassociating each commercial with a position onthe square. To ease the classification task, 25 rec-tangular regions were identified in the squarethrough a regular partition, which supports thedefinition of a video’s three different degrees ofconformity to a generic category. For instance, allvideos classified in region 4 of Figure 7a feature ahigh degree of conformity to the utopic categoryand a medium conformity to the playful one.

Figure 7 shows how the experts (Figure 7b) andthe system (Figure 7c) classified database com-mercials. Each region (cylinders of zero heightaren’t shown) in the square has a vertical cylinder,whose height is proportional to the percentage ofcommercials located in that region.

Figure 8 shows a more accurate measure of sys-tem effectiveness by displaying the agreementbetween the system and experts with reference toqueries for purely practical, critical, utopic, andplayful commercials. Each query considered thefive top-ranked commercials. For a generic com-mercial, we measured the agreement between thesystem and experts as A = 1 − d/dmax, where d is the“city block” distance between the two blocks inthe semiotic square where the system and theexperts located the commercial. dmax is the maxi-mum value of d, here 8. The best average agree-ment corresponds to the query for playfulcommercials, evidencing the effectiveness of thefeatures used to model this category. The worstperformance corresponds to queries for practicalcommercials. In this case, the system classifiedmany commercials as practical, while the expertsrated them as critical. Such a classification mis-match originates from the recurrent presence incritical commercials of foreground views of thepromoted product. The experts can easily detectand recognize a foreground view, but the systemcan’t detect this feature.

Figures 9 through 12 show examples of videoretrieval according to the expressive level using a

50

IEEE

Mul

tiM

edia

Figure 7. Representation of database commercials’

semiotic features (a) along with the experts’ (b) and

system’s (c) classification of them.

1 2 3 4 5 600

20

40

60

80

100

1 2 3 4 5 600

20

40

60

80

100

1 2 3 4 5 600

20

40

60

80

100

1 2 3 4 5 600

20

40

60

80

100

(a) (b)

(c) (d)

Average agreement

Agr

eem

ent

Retrieval ranking

Average agreement

Average agreement

Average agreement

Agr

eem

ent

Agr

eem

ent

Agr

eem

ent

Retrieval ranking

Retrieval ranking Retrieval ranking

Figure 8. Plots of the agreement between system and experts’ classification of

commercials with reference to queries for purely practical (a), critical (b),

utopic (c) and playful (d) commercials.

database of 131 commercials. Users can specifyqueries by defining the degree by which theretrieved commercials have to conform to the foursemiotic categories. Figure 9 shows the output ofthe retrieval system in response to a query forpurely playful commercials. At the right of the listof retrieved items, a bar graph displays the degreeby which each shot of the top-ranked video—thewhite, thin vertical lines represent cuts and dis-solves—belongs to the playful category. The top-ranked videos in this category all advertiseproducts for “young and smart” people (sports-wear, sport watches, blue jeans, and so on). Theseresults aren’t surprising, since they reflect thecommon marketing practice of targeting a com-mercial to a specific audience.

Figures 10 and 11 report some of the keyframesfor the two top-ranked spots in Figure 9. The firstbest-ranked spot (advertising sportswear) presentsall the typical features of a playful commercial,including a very fast rhythm, unorthodox cameratakes, situations—like skating in a tennis court—reflecting semantic issues at a higher level thanthat of our computer analysis, and very saturatedcolors. Similar features appear in the second-ranked commercial. In Figure 11 notice the pres-ence of quasi-identical keyframes (the close-ups ofthe man) typical of a nonlinear story. In this spotthe camera frenetically switches between close-upviews of the man and his dogs, almost never alter-nating details and global views like a utopic spotwould do.

Figure 12 shows the output of the retrieval sys-tem in response to a query for purely critical com-mercials. In contrast to the previous example, thetop-ranked videos obtained in response to thequery all advertise typical home products. Figures

51

July–Septem

ber 1999

Figure 9. Retrieval of playful commercials.

Figure 10. Some of the keyframes for the first-ranked spot in Figure 9 (“Sergio

Tacchini”).

Figure 11. Some of the keyframes for the second-ranked spot in Figure 9 (“Audi

A3”).

Figure 12. Retrieval of

critical commercials.

Cou

rtes

y of

RAI

Cou

rtes

y of

Med

iase

t

13 and 14 show some relevant frames for the twotop-ranked commercials in Figure 12.

Conclusions and future workIf, as the adage says, “an image is worth a thou-

sand words,” then every designer of multimediaretrieval systems is well aware that the conversealso holds true. That is, many times a word canstand for a thousand images, since it represents aclass of equivalence of objects, thus reflecting ahigher semantic level than that of the objectsthemselves. We believe that future multimediaretrieval systems will have to support access toinformation at different semantic levels to reflectdiverse application needs and user queries.

The research presented in this article attemptsto introduce different levels of signification by alayered representation of visual knowledge. Themost relevant insight we gained is that definingrules that capture visual meaning can be difficult,especially when dealing with general visualdomains. As a good design practice, such rulesshould derive from specific domain characteriza-tions, and possibly be refined and tailored to spe-cific classes of users.

Future work will address experimenting withdifferent application scenarios (such as TV newsand movies), and complementing visual featuresand language sentences with textual and audiodata. MM

AcknowledgmentsWe thank Bruno Bertelli and Laura Lombardi

for useful discussions during the development ofthis work. We’d also like to acknowledge all thoseexperts who provided their support in the perfor-mance tests.

References1. T. Joseph and A. Cardenas, “PicQuery: A High-Level

Query Language for Pictorial Database Manage-

ment,” IEEE Trans. on Software Engineering, Vol. 14,

No. 5, May 1988, pp. 630-638.

2. N. Roussopolous, C. Faloutsos, and T. Sellis, “An

Efficient Pictorial Database System for Pictorial

Structured Query Language (PSQL),” IEEE Trans. on

Software Engineering, Vol. 14, No. 5, May 1988,

pp. 639-650.

3. M. Flickner et al., “Query by Image and Video

Content: The QBIC System,” Computer, Vol. 28, No.

9, Sept. 1995, pp. 310-315.

4. J.R. Smith and S.F. Chang, “VisualSeek: A Fully

Automated Content-Based Image Query System,”

Proc. ACM Multimedia 96, ACM Press, New York,

Nov. 1996.

5. D.A. White and Ramesh Jain, “Similarity Indexing

with the SS-tree,” Proc. of the 12th Int’l Conf. on Data

Engineering, IEEE Computer Society Press, Los

Alamitos, Calif., Feb. 1996, pp. 516-523.

6. R.J. Brachman and H.J. Levesque, eds., Readings in

Knowledge Representation, Morgan Kaufmann, Los

Altos, Calif., 1985.

7. R. Arnheim, Art and Visual Perception: A Psychology of

the Creative Eye, Regents of the University of

California, Palo Alto, Calif., 1954.

8. J. Itten, Art of Color (Kunst der Farbe), Otto Maier

Verlag, Ravensburg, Germany, 1961 (in German).

9. C.R. Haas, Advertising Practice (Pratique de la

Publicité), Bordas, Paris, 1988 (in French).

52

IEEE

Mul

tiM

edia

Figure 13. Some of the keyframes for the first-ranked spot in Figure 12 (“Pasta

Barilla”).

Cou

rtes

y of

RAI

Figure 14. Some of the keyframes for the second-ranked spot in Figure 12 (“Cera

Emulsio”).

Cou

rtes

y of

Med

iase

t

10.A.J. Greimas, Structural Semantics (Sémantique

Structurale), Larousse, Paris, 1966 (in French).

11. J.-M. Floch, Semiotics, Marketing, and

Communication: Below the Signs, the Strategies

(Sémiotique, Marketing et Communication: Sous les

signes, les stratégies), University of France Press, Paris,

1990 (in French).

12.R.C. Carter and E.C. Carter, “CIELUV Color

Difference Equations for Self-Luminous Displays,”

Color Research and Applications, Vol. 8, No. 4, 1983,

pp. 252-553.

13. J.M. Corridoni, A. Del Bimbo, and P. Pala,

“Sensations and Psychological Effects in Color Image

Database,” ACM Multimedia Systems, Vol. 7, No. 3,

May 1999, pp. 175-183.

14.M. Caliani et al., “Commercial Video Retrieval by

Induced Semantics,” Proc. IEEE Int’l Work. on

Content-Based Access of Images and Video Databases,

IEEE CS Press, Los Alamitos, Calif., Jan. 1998,

pp. 72-80.

15.C. Colombo, A. Del Bimbo, and P. Pala “Retrieval of

Commercials by Video Semantics,” Proc. IEEE Int’l

Conf. on Computer Vision and Pattern Recognition

(CVPR 98), IEEE CS Press, Los Alamitos, Calif., June

1998, pp. 572-577.

Carlo Colombo is an assistant

professor in the Department of

Systems and Informatics at the

University of Florence, Italy. His

main research activities are in the

field of computer vision, with spe-

cific interests in image and video analysis, human-

machine interfaces, robotics, and multimedia. He holds

an MS in electronic engineering from the University of

Florence, Italy (1992) and a PhD in robotics from the

Sant’Anna School of University Studies and Doctoral

Research, Pisa, Italy (1996). He is a member of IEEE and

the International Association for Pattern Recognition

(IAPR), and presently serves as secretary to the IAPR

Italian Chapter.

Alberto Del Bimbo is a professor

in the Department of Systems and

Informatics at the University of

Florence, Italy. His scientific inter-

ests cover image sequence analysis,

shape and object recognition,

image databases and multimedia, visual languages, and

advanced man-machine interaction. He earned his MS

in electronic engineering at the University of Florence,

Italy in 1977. He is presently an associate editor of

Pattern Recognition, Journal of Visual Languages and

Computing, and IEEE Transactions on Multimedia. He is a

member of IEEE and IAPR. He presently serves as the

Chairman of the Italian Chapter of IAPR. He is the gen-

eral chair of IEEE Int’l. Conference on Multimedia

Computing and Systems (ICMCS 99).

Pietro Pala received his MS in

electronic engineering at the

University of Florence, Italy, in

1994. In 1998, he received his PhD

in information science from the

same university. Currently he is a

research scientist at the Department of Systems and

Informatics at the University of Florence. His current

research interests include recognition, image databases,

neural networks, and related applications.

Contact the authors at the Department of Systems

and Informatics, University of Florence, Via Santa Marta

3, I-50139, Florence, Italy, e-mail {columbus,delbimbo,

pala}@dsi.unifi.it.

53

July–Septem

ber 1999

Documents

Feature Article Semantics in Visual Information Retrievalivizlab.sfu.ca/arya/Papers/IEEE/Multimedia/1999/July... · 2009-05-12 · Feature Article. tator and the user are different