Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
A compositionalapproach increasesthe level ofrepresentation thatcan be automaticallyextracted and used ina visual informationretrieval system.Visual information atthe perceptual levelis aggregatedaccording to a set ofrules. These rulesreflect the specificcontext andtransform perceptualwords into phrasescapturing pictorialcontent at a higher,and closer to thehuman, semanticlevel.
Visual information retrieval systemshave entered a new era. First-generation systems allowed access toimages and videos through textual
data.1,2 Typical searches for these systems include,for example, “all images of paintings of theFlorentine school of the 15th century” or “allimages by Cezanne with landscapes.” Such sys-tems expressed information through alphanu-meric keywords or scripts. They employedrepresentation schemes like relational models,frame models, and object-oriented models. On theother hand, current-generation retrieval systemssupport full retrieval by visual content.3,4 Accessto visual information is not only performed at aconceptual level, using keywords as in the textu-al domain, but also at a perceptual level, usingobjective measurements of visual content. Inthese systems, image processing, pattern recogni-tion, and computer vision constitute an integralpart of the system’s architecture and operation.They objectively analyze pixel distribution andextract the content descriptors automatically fromraw sensory data. Image content descriptors arecommonly represented as feature vectors, whoseelements correspond to significant parametersthat model image attributes. Therefore, visualattributes are regarded as points in a multidimen-sional feature space, where point closeness reflectsfeature similarity.
These advances (for comprehensive reviews of
the field, see the “Further Reading” sidebar) havepaved the way for third-generation systems, fea-turing full multimedia data management and net-working support. Forthcoming standards such asMPEG-4 and MPEG-7 (see the Nack and Lindsayarticle in this issue) provide the framework for effi-cient representation, processing, and retrieval ofvisual information.
Yet many problems must still be addressed andsolved before these technologies can emerge. Animportant issue is the design of indexing struc-tures for efficient retrieval from large, possibly dis-tributed, multimedia data repositories. To achievethis goal, image and video content descriptors canbe internally organized and accessed throughmultidimensional index structures.5 A second keyproblem is to bridge the semantic gap between thesystem and users. That is, devise representationscapturing visual content at high semantic levelsespecially relevant for retrieval tasks. Specifically,automatically obtaining a representation of high-level visual content remains an open issue.Virtually all the systems based on automatic stor-age and retrieval of visual information proposedso far use low-level perceptual representations ofpictorial data, which have limited semantics.
Building up a representation proves tanta-mount to defining a model of the world, possiblythrough a formal description language, whosesemantics capture only a few significant aspects ofthe information content.6
Different languages and semantics inducediverse world representations. In text, for exam-ple, the meaning of single words is specific yetlimited, and an aggregate of several words—aphrase—produces a higher degree of significanceand expressivity. Hence, the rules for the syntac-tic composition of signs in a given language alsogenerate a new world representation, offeringricher semantics in the hierarchy of signification.
To avoid equivocation, a retrieval systemshould embed a semantic level reflecting as muchas possible the one humans refer to during inter-rogation. The most common way to enrich a visu-al information retrieval system’s semantics is toannotate pictorial information manually at stor-age time through a set of external keywordsdescribing the pictorial content. Unfortunately,textual annotation has several problems:
1. It’s too expensive to go through manual anno-tation with large databases.
2. Annotation is subjective (generally, the anno-
38 1070-986X/99/$10.00 © 1999 IEEE
Semantics inVisualInformationRetrieval
Carlo Colombo, Alberto Del Bimbo, and Pietro PalaUniversity of Florence, Italy
Feature Article
tator and the user are different persons).
3. Keywords typically don’t support retrieval bysimilarity.
Automatically increasing the semantic level ofrepresentation provides an alternative. Startingfrom perceptual features—the atomic elements ofvisual information—some intermediate semanticlevels can be extracted using a suitable set of rules.Perceptual features represent the evidence uponwhich to build the interpretation of visual data. Aprocess of syntactic construction called composi-tional semantics builds the semantic representation.
In this article, we discuss how to extract auto-matically—from raw image and video data—twodistinct semantic levels and how to represent theselevels through appropriate language rules. As weorganize semantic levels according to a significa-tion hierarchy, the corresponding description lan-guages become stratified, allowing the compositionof higher semantic levels according to syntacticrules that combine perceptual features and lowerlevel signs. Since these languages directly dependon objective features, the approach naturallyaccommodates visual search by example andretrieval by similarity. We’ll address two differentvisual contexts: art paintings and commercialvideos. We’ll also present retrieval examples show-ing that compositional semantics improves accor-dance with human judgment and expectation.
A language-oriented approachHere we discuss the compositional semantics
framework we developed. We also provide back-ground theories in art and advertising.
Compositional semantics frameworkThe compositional semantics framework
involves a bottom-up analysis and processing of avisual message, starting from its perceptual fea-tures. For still images, these features are image col-ors and edges. For videos or image sequences,additional features include the presence of editingeffects, the motion of objects within a scene, andso on. Without loss of generality, the perceptualproperties of a visual message can be representedthrough a set of scores P = {ϕ i}, i = 1, …, n, eachscore ϕ i∈ [0, 1] representing the extent to whichthe i-th feature appears in the message.
We devised two distinct levels of the significa-tion hierarchy, namely the expressive and the emo-tional levels, as plausible intermediate stepsinvolved in the construction of meaning.
Semantic levels: expression and emotion. Atthe expressive level, perceptual data are organizedinto a group of new features—the expressive fea-tures—taking into account both spatial and tem-poral distributions of perceptual features.Expressive features reflect concepts that humansembody at a higher level of abstraction to achievea more compact visual representation.
Combination rules are modeled as functions Facting over the perceptual feature set P and return-ing a score expressing the degree of truth by whichthe rule F holds. Hence, a rule Fj can be defined as
Fj : [0, 1]n → [0, 1]
Operators of logical composition between rulesextend the signification of the representation. Wedefine these operators as
F1 ∧ F2 = min(F1, F2) F1 ∨ F2 = max(F1, F2)
The expressive feature set F = {F1, …, Fm} qualifiesthe content of the visual message at the expressivelevel.
A musical example lets us capture the distinc-tion between the expressive and emotional levels.Assume that the laws of harmony and counter-point characterize music composition at theexpressive level. However, following these rulesdoesn’t guarantee a pleasant result for the audi-ence. In other words, musical fruition and under-standing involves adopting aesthetic criteria thatgo beyond expression—that is, the syntax ofmeaning—and reach emotion as the ultimatesemantics of meaning. (The art of J.S. Bach pro-vides a remarkable example of how to reach musi-
39
July–Septem
ber 1999
Further ReadingFor a comprehensive introduction to visual information retrieval, see
A. Del Bimbo, Visual Information Retrieval, Academic Press, London, 1999.
P. Aigrain, H. Zhang, and D. Petkovic, “Content-Based Representation and
Retrieval of Visual Media: A State-of-the-Art Review,” Multimedia Tools and
Applications, Vol. 3, No. 4, Dec. 1996, pp. 179-202.
A. Gupta and R. Jain, “Visual Information Retrieval,” Comm. of the ACM, Vol. 40,
No. 5, May 1997, pp. 71-79.
A review of the state of the art in visual information processing can befound in
B. Furht, S.W. Smoliar, and H.J. Zhang, Video and Image Processing in Multimedia
Systems, Kluwer Academic Publishers, Boston, 1996.
cal beauty by a skillful adherence to the formalrules of 18th century music.) We formally con-struct the emotional level, at the top of our signi-fication hierarchy, from the levels below—namely,the expressive and perceptual levels. With a nota-tion similar to the one above, rules at the emo-tional level are represented through functions Gacting over the set of perceptual and expressivefeatures P × F and returning a score that expressesthe degree of truth by which the rule G holds.Hence, a rule Gk can be defined as
Gk : [0, 1]n+m → [0, 1]
Operators of logical composition between rulescan extend the representation’s semantics. The setG = {Gk} qualifies the content of the visual messageat the emotional level.
Perceptual, expressive, and emotional featuresqualify the meaning of a visual message at differ-ent levels of signification. For all these levels, con-struction rules depend on the specific data domainto which they refer (such as movies, commercials,and TV news for videos; paintings, photographs,and trademarks for still images). Specifically, theexpressive level features objective properties thatgenerally depend on collective cultural back-grounds. The emotional level, on the contrary,relies on subjective elements such as individualcultural background and psychological moods.
Background theoriesHere we present theories that provide a refer-
ence framework for developing expressive andemotional rules in the domains of art images andcommercial videos.
Expression in art images. Among the manyauthors who recently addressed the psychology ofart images, Arnheim discussed the relationshipsbetween artistic form and perceptive processes,7
and Itten8 formulated a theory about use of colorin art and about the semantics it induces. Ittenobserved that color combinations induce effectssuch as harmony, disharmony, calmness, andexcitement that artists consciously exploit in theirpaintings. Most of these effects relate to high-levelchromatic patterns rather than to physical prop-erties of single points of color (see, for example,Figure 1a). Such rules describe art paintings at theexpressive level. The theory characterizes colorsaccording to the categories of hue, luminance,and saturation. Twelve fundamental hues are cho-sen and each of them is varied through five levels
40
IEEE
Mul
tiM
edia
Equatorial section Longitudinal section
External views
θρ
φ
Geographic coordinates
(a)
(b)
(c)
Orange-yellowYellow
Green
Pure color circle
The four colors accordance
The three colors accordance
Green-yellow
Green-blue
Orange
RedBlue
Purple-redPurple-bluePurple
Orange-red
Figure 1. (a) Contrasts
and colors in an art
image. (b) The Itten
sphere. (c) The polygons
generating three- and
four-chromatic
accordance.
of luminance and three levels of saturation. Thesecolors are arranged in a chromatic sphere calledthe Itten sphere (Figure 1b), such that perceptualcontrasting colors have opposite coordinates withrespect to the center of the sphere.
Analyzing the polar reference system, we canidentify four different types of color contrasts:pure colors, light-dark, warm-cold, and saturated-unsaturated (quality). Psychological studies havesuggested that, in western culture, red-orangeenvironments induce a sense of warmth (yellowthrough red-purple are warm colors). Conversely,green-blue conveys a sensation of cold (yellow-green through purple are cold colors). Cold sen-sations can be emphasized by the contrast with awarm color or dampened by its coupling with ahighly cold tint. The concept of harmonic accor-dance combines hues and tones to generate a sta-bility effect onto the human eye. Harmony can beachieved by creating color accordances, generat-ed on the sphere by connecting locations throughregular polygons (see Figure 1c).
Emotion in art images. Color combinationsalso induce emotional effects. The mapping oflow-level color primitives into emotions is quitecomplex. It must consider theories about the useof colors and cognitive models, and involve cul-tural and anthropological backgrounds. In otherwords, people with different cultures will likelyshare the same expressive level representation butnot perceive the same emotions in front of thesame color pattern.
Artists use color combinations unconsciouslyand consciously to produce optical and psycho-logical sensations.9 Warm colors attract the eyeand grab attention of the observer more than coldcolors. Cold sensations provided by a large regionof cold color can be emphasized by the contrastwith a warm color or dampened by its couplingwith a highly cold tint. Similarly, small, close coldregions emphasize large, warm regions. Red com-municates happiness, dynamism, and power.Orange, the warmest color, resembles the color offire and thus communicates glory. Green com-municates calmness and relaxation, and is thecolor of hope. Blue, a cold color, improves thedynamism of warm colors, suggesting gentleness,fairness, faithfulness, and virtue. Purple, a melan-choly color, sometimes communicates fear.Brown generally serves as the background colorfor relaxing scenes. Differences in lightness deter-mine a sense of plasticity and the perception ofdifferent planes of depth. The absence of con-
trasting hues and the presence of a single domi-nant color region inspire a sense of uneasinessstrengthened by the presence of dark yellow andpurple colors. The presence of regions in har-monic accordance communicates a sense of sta-bility and joy. Calmness and quiet can beconveyed through combining complementarycolors. In the presence of two noncomplementarycolors, the human eye looks for the complemen-tary of the observed color. This rouses a sense ofanguish.
Another basic contribution to the semantics ofan image relates to the presence and characteris-tics of notable lines. In fact, line slope is a key fea-ture through which the artist communicatesdifferent emotions. For example, an oblique slopecommunicates dynamism and action, while a flatslope, such as a horizon, communicates calmnessand relaxation. Figure 2 shows an example, illus-trating how saturated colors can combine withsloped lines to communicate a sensation ofdynamism in the observer.
Expression in commercials. Semiotics studiesthe meaning of symbols and signs. In semiotics, asign represents anything that conveys meaningaccording to commonly accepted conventions.Semiotics therefore suggests that signs relate totheir meaning according to the specific culturalbackground. Semioticians usually identify two dis-
41
July–Septem
ber 1999
Figure 2. Some frames taken from a “Sergio Tacchini” commerical video. The
use of colors and lines in each frame communicates dynamism. In this video,
the effect is further enhanced by a high frequency of shots.
Cou
rtes
y of
Med
iase
t
tinct steps for the production of meaning:10
1. an abstract level formed by narrative structures,that is, structures including all those basicsigns that create meaning and those valuesdetermined by sign combinations, and
2. a concrete level formed by discourse structures,describing the way in which the author usesnarrative elements to create a story.
Representing the expressive and emotional levelfor commercial videos inherits some of the fea-tures introduced before for still images and addsnew features related to the way frames are con-catenated. Semiotics classifies advertisingimages—and specifically commercials—into fourdifferent categories (noted below) that relate tothe narrative element.11 Directors will use themain narrative signs of the video—camera breaks,colors, editing effects, rhythm, shot angles, andlines—in a peculiar way, depending on the spe-cific video typology considered.
❚ Practical commercials emphasize the qualitiesof a product according to a common set of values. The advertiser describes the product ina familiar environment so that the audiencenaturally perceives it as useful. Camera takesare usually frontal, and transitions take placein a smooth and natural way. This implieschoosing long dissolves for merging shots, andthe prevalence of horizontal and verticallines—giving the impression of relaxation andsolidity.
❚ Critical commercials introduce a hierarchy ofreference values. The product serves as the sub-ject of the story, which focuses on the produc-t’s qualities through an apparently objectivedescription of its features. The scene has toappear more realistic than reality itself. For thisreason, the commercial has a minimum num-ber of camera breaks. Due to smooth cameramotions, the ever-changing colors in the back-ground draw the audience’s attention to the(constant) color of the product.
❚ Utopic commercials provide evidence that theproduct can succeed in critical tests. Here, thestory doesn’t follow a realistic plot. Rather, sit-uations appear as in a dream. Mythical scenar-ios are chosen to present the product, shownto succeed in critical conditions often in a fan-
tastic and unrealistic way. The director createsa movie-like atmosphere, with a set of domi-nant colors defining a closed-chromatic worldand with all the traditional editing effects (cuts,dissolves) possibly taking place.
❚ Playful commercials emphasize the accordancebetween users’ needs and product qualities.Here a manifest parody of the other typologiesof ads takes place. The commercial clearlystates to the audience that they’re watchingadvertising material. Situations and places vis-ibly differ from everyday life, and they’redeformed in such a caricatural and grotesquefashion that the agreement between productqualities and purchaser’s needs is often shownin an ironic way (such as an old woman dri-ving a Ferrari). The director emphasizes thepresence of the camera in the ad and uses allpossible effects to stimulate the active partici-pation of the audience. Also, everything looksstrange and “false”—use of unnatural colors,improbable camera takes, and so on.
Emotion in commercials. The choice of agiven combination of narrative signs affects boththe expressive and emotional levels of significa-tion. While dissolves communicate calmness andrelaxation, cuts increase the video’s dynamism(see again Figure 2). Sometimes in commercials,directors emphasize cuts by including a whiteframe between the end of the first frame and thebeginning of the second, thus inducing a sense ofshock in the observer. The semantics associatedwith editing effects thus relates to the use of cutsor dissolves and to the length of the shots in thevideo. If the video consists of a few long shots, theeffect is that of calmness and relaxation.Alternatively, videos with many short shots sepa-rated by cuts induce a sense of action anddynamism in the audience.
Content representation and semantics-based retrieval
In the framework of our research, we used thetheories presented in the previous section to qual-ify the content of a visual message at an interme-diate semantic level. Specifically, we’ve exploitedItten’s theory and semiotic principles to representthe visual content at the expressive level. Ourwork supports defining formal rules to qualify theeffects of color, geometric, and dynamic featurecombinations. We’ve also exploited psychologi-cal principles of visual communication to repre-
42
IEEE
Mul
tiM
edia
sent the visual content at the emotional level. Inthe next section, we use the semantic descriptorsautomatically extracted from art images and com-mercial videos for retrieval purposes.
Content representation for art imagesHere we discuss the rules needed to represent
the content of art images.
Expressive level. Exploiting Itten’s model toqualify an image’s chromatic content at an expres-sive level requires
1. segmenting the image into regions character-ized by uniform colors, and
2. representing chromatic and spatial features ofimage regions.
We segment images by looking for clusters in thecolor space, then back-projecting cluster cen-troids in the feature space onto the image.Segmentation occurs through selecting theappropriate color space so that small feature dis-tances correspond to similar colors in the per-ceptual domain. Adopting the InternationalCommission on Illumination (CommissionInternationale de L’Eclairage—CIE) L*u*v*space12 accomplishes this.
Once segmented into regions, the image’sregion features can be described in terms of intra-region and inter-region properties. Intra-regionproperties include region color, warmth, hue,luminance, saturation, position, and size. Inter-region properties consist of hue, saturation,warmth, luminance contrast, and harmony. Tomanage the vagueness of chromatic properties, weuse a fuzzy representation model to describe ageneric property’s value. Therefore, we candescribe a generic property by introducing n ref-erence values for that property. Then, consideringn scores (ϕ1, … ϕn), the i-th score measures theextent to which the region conforms to the i-threference value. For instance, if we introduce threereference values for the luminance of a region(corresponding to dark, medium, and bright), thedescriptor (0.0, 0.1, 0.9) represents a very brightregion’s luminance.
According to Itten’s rules, these abstract prop-erties must be translated into language sentences.In formal language theory notation, region for-mulas (φ) that characterize chromatic and arrange-ment properties of color patches represent thesesentences through
φ := region | hue = λh | lum = λl | sat = λs || warmth = λw | size = λS | position = λp |
| Contrastγ(φ1, φ2) || Harmony (φ1, … ,φn) | φ1 ∧ φ2 | φ1 ∨ φ2
where λγ is a feasible value for the measure γ, withγ ∈ {h, l, s, w} denoting the attributes of hue, lumi-nance, saturation, and warmth, respectively. Wedefine semantic clauses in terms of a fulfillmentrelation |=of a generic formula φon a region R. Thedegree of truth by which φ is verified over R isexpressed by a value ξ that’s computed consider-ing fuzzy descriptors of region properties. We codesemantic clauses into a model-checking engine.Given a generic formula φ and an image I, theengine computes a score representing the degreeof truth by which φ is verified over I (see the side-bar “Model-Checking Engine” on the next page).13
Emotional level. The psychological analysis ofeffects induced by images suggests that we can useonly a limited subset of primary emotions toexpress the great variability of human emotions(secondary emotions). Explicitly defining the rulesmapping expressive and perceptual features ontoemotions would require us to build an overcom-plicated model taking into account cultural andbehavioral issues. Hence, we prefer to determinethe relative relevance of each single featurethrough an adaptation process that fits a specificculture and fashion.
Explicitly, the definition of the rules mappingexpressive and perceptual features onto emotionsinvolves
1. Identifying a set of primary emotions. We identi-fied four primary emotions, namely, action,relaxation, joy, and uneasiness, as the mostrelevant to express the interaction of humanswith images. Referring to the general schemepresented above, secondary emotions such asfear and aggressiveness can be expressed ascombinations of action and uneasiness.
2. Identifying for each primary emotion a set of plau-sible inputs. In our model, we expect contrastsof warmth and contrasts of hue to generate asense of action and dynamism. The presenceof lines with a high slope reinforces this.
3. Defining a training set for the model. Given adatabase of images, we use some images astemplates of primary emotions. This trainingset adapts weights of perceptual features to
43
July–Septem
ber 1999
determine the relative relevance of each fea-ture for the emotion induced in subjects of acertain cultural context.
We summarize the dependence between per-ceptual/expressive features and emotions below.We measure the degree of action communicatedby an image as
where Fwarmth−c and Fhue−c represent the expressivefeatures measuring the presence of warmth andhue contrasts, and ϕslanted the perceptual featuremeasuring the presence of slanted lines (the wi’sare the feature weights). We can measure thedegree of relaxation an image communicates by
where Flum−c represents the expressive features mea-suring the presence of luminance contrasts andϕbrown and ϕgreen the perceptual features measuringthe presence of brown and green regions. Thedegree of joy communicated by an image can bemeasured as
where Fharmony represents the expressive featuresmeasuring the presence of regions in harmonicaccordance. Finally, we measure the degree ofuneasiness an image communicates as
G g F wI Ijjoy harmony 1 = ( , )3
G g F
w w w
I Ic
r r r
relax lum brown green = ( , , ,
, , )
2
1 2 3
− ϕ ϕ
G g F F
w w w
I Ic c
a a a
action warmth hue slanted = ( , , ,
, , )
1
1 2 3
− − ϕ
44
IEEE
Mul
tiM
edia
Model-Checking EngineA model-checking engine decides the satisfaction of a formula φ over an
image I with a two-step process. First, the engine recursively decomposes for-mula φ into subformulas in a top-down manner. This allows the representationof φ with a tree in which leaf nodes represent intra-region specifications.Afterwards, the engine labels regions in the image description with the sub-formulas they satisfy in a bottom-up approach. The engine decides the satis-faction of region formulas by directly referring to the chromatic fuzzydescriptors. This first labeling level is then exploited to decide the satisfactionof composition formulas. In this labeling process, the engine first checks if theregion contains pixels that can satisfy the subformula. If it can, the formulalabels the region with the subformula and with the degree of truth by whichthe region is satisfied. When the degree of truth falls under a given minimumthreshold, the engine drops the image from the candidate list.
In Figure A, the engine computes the degree by which the formula φ =Contrastw (hue = orange, size = large) holds over an image. The engine assumesthat the image has just been segmented. For simplicity, the engine considersonly four regions, namely R1, R2, R3, and R4. First, φ decomposes intosubformulas:
φ3 : Contrastw (φ1, φ2)φ2 : size = largeφ1 : hue = orange
Then, according to a bottom-up approach, the engine labels regions withthe subformulas they satisfy:
Step 1 Label R2 and R4 with φ1
Step 2 Label R1 with φ2
Step 3 Label R1, R2, and R4 with φ3
At the end of this process, the engine labels three regions with φ3, thus theoriginal formula φ is satisfied over the image.
R4
R1
R3
R2
φ = Contrastw (hue = orange, size = large)
φ3: Contrastw (φ1, φ2)φ2: hue = orangeφ1: size = large
φ3
φ1 φ2
Figure A. The model-checking engine. From top to
bottom: formula decomposition, segemented
image, and original image.
where Fn−hue−c represents the expressive featuresmeasuring the absence of contrasts of hue, andϕyellow and ϕpurple the perceptual features measuringthe presence of yellow and purple regions.
Presently we use a linear mapping to model thegi
I functions and a linear regression scheme toachieve adaptation of weights wi based on a set oftraining examples.
Content representation for commercial videosNow we’ll present the content representation
rules for commercial videos.
Expressive level. Semiotic principles permitestablishing a link between the set of perceptu-al features and each of the four semiotic cate-gories of commercial videos (practical, playful,utopic, and critical). This supports organizing adatabase of commercials on the basis of theirsemiotic content. Once a video has been seg-mented into shots, all perceptual features areextracted. Each feature’s value is representedaccording to a fuzzy representation model. Ascore in [0, 1] is introduced for each feature toqualify the feature’s presence in the video. Inter-frame features address
❚ the presence of cuts (ϕcuts) and dissolves (ϕdissolves),
❚ the presence or absence of colors recurring inmany frames (ϕrecurrent ~1 and ϕrecurrent ~0, respec-tively), and
❚ the presence or absence of editing effects (ϕedit
~1 and ϕedit ~0, respectively).
Intra-frame features address the presence of
❚ horizontal and vertical or slanted lines (ϕhor/vert
~1 and ϕhor/vert ~0, respectively) and
❚ saturated or unsaturated colors (ϕsaturated ~1 andϕsaturated ~0, respectively).
Table 1 summarizes how perceptual featuresare mapped onto the four semiotic categories. Weadopted a linear mapping where table entries indi-cate expected feature weights and “×” indicatesirrelevance.
We represent the video content through sen-tences qualifying the presence of critical, utopic,
practical, and playful features. These sentences areexpressed through formulas (Φ) defined as
Φ := Fpractical ≥ k1 | Fplayful ≥ k2 | Futopic ≥ k3
|Fcritical ≥ k4 | Φ1 ∧ Φ2 | Φ1 ∨ Φ2
where the ki’s represent fixed threshold values.
Emotional level. High-level properties of acommercial relate to the feelings it inspires in theobserver. We’ve organized these characteristics ina hierarchical fashion. A first classification sepa-rates commercials that induce action from thosethat induce quietness. Each class then furthersplits to specify the kind of action and quietness.Explicitly, action in a commercial can induce twodifferent feelings—suspense and excitement.Similarly, quietness can be further specified asrelaxation and happiness.
In the following, we describe the mappingbetween emotional features and low-level perceptu-al features reflecting reference principles of visualcommunication. (See also Table 2 on the next page.)
❚ Action: The degree of action for a videocan be improved by the presence of red andpurple. Action videos are often characterized byframings exhibiting high slanting. Shortsequences are joined by cuts. The main distin-guishing feature of these videos lies in the pres-ence of a high degree of motion in the scene.
❚ Excitement: Videos that communicate excite-ment (with degree GV
excite) typically feature shortsequences joined through cuts.
❚ Suspense: To improve the degree of suspenseGV
suspense in a video that communicates action,directors combine both long (ϕ long) and short(ϕshort) sequences in the video and join themthrough frequent cuts.
GVaction
G g F
w w w
I In c
u u u
uneas hue yellow purple = ( , , ,
, , )
4
1 2 3
− − ϕ ϕ
45
July–Septem
ber 1999
Table 1. Perceptual mapping onto semiotic categories.
Semiotic Categories Perceptual Features Fpractical Fplayful Futopic Fcritical
ϕsaturated ~0 ~1 × ×ϕrecurrent × × ~1 ~0
ϕhor/vert ~1 ~0 × ~1
ϕcuts × ~1 ~1 ×ϕdissolves ~1 × ~1 ×ϕedit × × × ~0
❚ Quietness: The degree of quietness GVquiet for a
video can be improved by the presence of blue,orange, green, and white colors, and loweredby the presence of black and purple. Quietvideos feature horizontal framings. A few longsequences might be present, possibly joinedthrough dissolves.
❚ Relaxation: Videos that communicate relax-ation (with degree ) don’t show relevantmotion components.
❚ Happiness: Videos that communicate happiness(with degree GV
happy) share the same features asquiet videos but also exhibit a relevant motioncomponent.
Presently, we use a linear approximation tomodel emotional feature mapping, constructed byweight adaptation according to a linear regressionscheme. Following the above scheme, the extentto which a video k conforms to one of the sixclasses is computed through six scores{GV
j (k)}61 ,
each representing a weighted combination of low-level video features. To achieve good discrimina-tion between the six classes of commercialsrequires that for a generic video belonging to cat-egory j, the ratio GV
j /GVm must be high, at least for
all the categories not a generalization of categoryj (we call these categories j-opponent categories).
Model validation and retrieval resultsAt the Visual Information Processing Lab of the
University of Florence, we’ve used the composi-
tional semantics framework expounded earlier todevelop systems for retrieving still images andvideos based on both expressive and emotionalcontent.13,14 We discuss the method used in thesesystems for extracting perceptual features fromraw visual data in the sidebar “Perceptual Featuresfor Still Images and Videos.”
In this section, we present examples of retriev-ing art images and commercial videos accordingto the expressive level. You can find retrievalbased on emotional features elsewhere.15 We eval-uated retrieval performance in terms of effective-ness, that is, a measure of the agreement betweenhuman evaluators and the system in ranking a testset of images according to their similarity to aquery. Measuring effectiveness reliably requires asmall image test set (typically 20 to 50 images).Given a sample query, we evaluated the agree-ment as the percentage of human evaluators whorank images in the same (or very close) position asthe system does. We define effectiveness as
where i is an image from the test set, j denotes thesample query, and σj(i) is the width of the windowcentered in the rank Pj(i) assigned by the system.Qj(i, k) is the percentage of people who ranked thei-th image in a position between Pj(i) − σj(i) andPj(i) + σj(i).
S i Q i kj j
k P i i
k P i i
j j
j j
( ) ( , )( ) ( )
( ) ( )
== −
= +
∑σ
σ
GVrelax
46
IEEE
Mul
tiM
edia
Table 2. Perceptual mapping onto emotional features.
Emotion Categories Perceptual Features Action Excitement Suspense Quietness Relaxation Happiness
ϕdissolves × × × ~1 ~1 ~1
ϕcuts ~1 ~1 ~1 ~1 ~1 ~1
ϕlong × × ~1 ~1 ~1 ~1
ϕshort × ~1 ~1 ~1 ~1 ~1
ϕmotion ~1 ~1 ~1 × ~0 ~1
ϕhor/vert ~0 ~0 ~0 ~1 ~1 ~1
ϕred ~1 ~1 ~1 × × ×ϕorange × × × ~1 ~1 ~1
ϕgreen × × × ~1 ~1 ~1
ϕblue × × × ~1 ~1 ~1
ϕpurple ~1 ~1 ~1 ~0 ~0 ~0
ϕwhite × × × ~1 ~1 ~1
ϕblack × × × ~ 0 ~ 0 ~ 0
“×” indicates irrelevance
47
Here we discuss the perceptual features for stillimages and videos.
Still imagesAn image’s semantics relates to its color con-
tent and the presence of elements such as linesthat induce dynamism and action.
ColorsColor cluster analysis helps segment images into
color regions. We can obtain clustering in 3D spaceby using an improved version of the standard K-means algorithm, which avoids converging withnonoptimal solutions. This algorithm uses competi-tive learning as the basic technique for groupingpoints in the color space. An image’s chromatic con-tent can be expressed using a set of eight numbersnormalized in [0, 1] denoting one color out of theset {red, orange, yellow, green, blue, purple, white,and black}. Each number quantifies the presence inthe image of a region exhibiting the i-th color.
LinesDetecting significant line slopes in an image can
be accomplished by using the Hough transform togenerate a line slope histogram (see Figure B). Thefeature ϕhor/vert ∈ [0, 1] gives the ratio of horizontaland vertical lines with respect to the overall num-ber of lines in the image.
VideosVideo analysis primarily aims to segment video,
that is, to identify each shot’s start and end pointsand the video’s characterization through its mostrepresentative keyframes. Once a video has beenfragmented into shots and video editing featureshave been extracted, each shot’s content can be
internally described by segmenting each shot’skeyframe as described in the previous section ofthis sidebar.
CutsRapid motion in the scene and sudden changes
in lighting yield low correlation between contigu-ous frames, especially in cases adopting a hightemporal subsampling rate. To avoid false cutdetection, we studied a metric insensitive to suchvariations while reliably detecting true cuts. Wepartition each frame into nine subframes and rep-resent each subframe by considering the color his-tograms in the hue, saturation, intensity (HSI) colorspace. We detect cuts by considering the volumeof the difference of subframe histograms in twoconsecutive frames. The presence of a cut at framei can be detected by thresholding the average vol-ume value of for the nine subframes. After repeat-ing the above procedure for all frames i = 1 …#frames, we simply obtain the overall feature relat-ed to the presence of cuts in a video as ϕcuts =#cuts/#frames, where ϕcuts ∈ [0, 1].
DissolvesThe dissolve effect merges two sequences by
partly overlapping them. Detecting dissolves incommercials proves particularly difficult becausedissolves typically occur in a limited number ofconsecutive frames. Due to this peculiarity, exist-ing approaches to detect dissolves (developed formovies) have shown poor performance. We useinstead corner statistics to detect dissolves. Duringthe editing effect, corners associated with the firstsequence gradually disappear and those associat-ed with the second sequence gradually appear.
Perceptual Features for Still Images and Videos
’Stacchini.mv.L2’
0 45 90 135 180 225 270
80
70
60
50
40
30
20
10
0
(1) (2) (3)
Occ
uren
ces
Slope
Figure B. Perceptual
features for still images.
From left to right: video
frame, computed edges,
and line slope
histogram.continued on p. 48
Image retrievalFor art image retrieval, we used a test set of 40
images, including Renaissance to contemporarypainters. We measured retrieval effectiveness withreference to four different queries, addressing con-trasts of luminance, warmth, saturation, andharmony.
We collected answers from 35 experts in thefine arts field. We asked them to rank databaseimages according to each reference query. Figure3 shows the test database images. Figure 4 showsthe eight sets of top-ranked images the systemretrieved in response to the four reference queries.(For more information about these images visithttp://www.cineca.it/wm/.)
Figure 5 shows the plots of effectiveness S as afunction of rankings. They show the agreementbetween the people interviewed and the system inranking images according to the queries. Figure 5shows only rankings from 1 to 8, since they rep-resent agreement on the most representativeimages. The plots show a very large agreement
48
This yields a local minimum in the number of cor-ners detected during the dissolve (see Figure C).An image corner is characterized by large and dis-tinct values of the gradient auto-correlationmatrix’s eigenvalues. We evaluate the featureϕdissolves ∈ [0, 1] as ϕdissolves = #dissolves/#frames.
MotionWe analyze motion by tracking corners in a
sequence. For each shot, we compute a featureϕmotion ∈ [0, 1] that represents the average intensi-ty of motion. ϕmotion = 0 means that motion isabsent during the sequence. Higher values of ϕmotion indicate the presence of increasingly relevantmotion components.
Inter-shot featuresWe represent color-related inter-shot features
used in a video as ϕrecurrent ∈ [0, 1] (expressing the
ratio between colors that recur in a high percent-age of keyframes and the overall number of signif-icant colors in the video sequence) and ϕsaturated
(expressing the relative presence of saturated col-ors in the scene). Another relevant inter-shot char-acteristic—the rhythm of a sequence ofshots—relates to shot duration and to the use ofcuts and dissolves to join shots. We define therhythm r(i1, i2) ∈ [0, 1] of a video sequence over aframe interval [i1, i2] as
where #cuts and #dissolves are measured in thesame interval. A simple feature measuring an entiresequence’s internal rhythm is the average rhythm,which relates to the overall number of breaks ϕedit =r(1, #frames).
r i ii i
( , )2 22 1 1
=+− +
# cuts # dissolves
(1) (2) (3) (4)
Figure C. Corners
detected during a
dissolve effect.
continued from p. 47
Cou
rtes
y of
Web
Mus
eum
Figure 3. Images used to test system effectiveness for the representation of
expressive content.
between the experts and the system in assigningsimilarity rankings.
Figure 6 shows an example of retrieval accord-ing to the expressive level using a database of 477images representing paintings from the 15th tothe 20th century. Two dialog boxes define prop-erties (hue and dimension) of the two sketched
regions of Figure 6. Theuser may define thedegree of truth (60 per-cent) assigned to the
query. Figure 6 (right) shows the retrieved paint-ings. The 12 best-matched images all feature largeregions with contrasting luminance. The top-ranked image represents an outstanding exampleof luminance contrast, featuring a black regionover a white background. Images in the second,third, fifth, sixth, and seventh positions exempli-
49
July–Septem
ber 1999
(d)
(c)
(b)
(a)
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
Figure 4. Top-ranked images according to queries for contrast of luminance (a),
contrast of saturation (b), contrast of warmth (c), and harmonic accordance (d).
0
20
4 5 6 7 8 93
40
60
80
100
0 1 2
0
20
4 5 6 7 8 93
40
60
80
100
0 1 2
0
20
4 5 6 7 8 93
40
60
80
100
0 1 2
0
20
4 5 6 7 8 93
40
60
80
100
0 1 2
(a)
(b)
(c)
(d)
Average agreement
Agr
eem
ent
Retrieval ranking
Average agreement
Agr
eem
ent
Agr
eem
ent
Retrieval ranking
Average agreement
Agr
eem
ent
Retrieval ranking
Average agreement
Retrieval ranking
Figure 5. System effectiveness measured against
experts for four different sample queries: constrast
of luminance (a), constrast of saturation (b), con-
trast of warmth (c), and harmonic accordance (d).
Figure 6. Results of a query for images with two large regions showing
constrasting luminance.
Cou
rtes
y of
Web
Mus
eum
fy how contrast ofluminance betweenlarge regions can beused to convey the per-ception of differentplanes of depth.
Video retrievalWe evaluated video retrieval effectiveness
using a test set of 20 commercial videos. A teamof five experts in the semiotic and marketing fieldsclassified each video in terms of the four semioticcategories. We used the common representationof commercials categories based on the semioticsquare. This instrument, first introduced byGreimas,10 combines pairs of four semiotic objectswith the same semantic level, according to threebasic relationships: opposition, completion, andcontradiction. Objects placed at opposite sides ofthe square are complementary. Figure 7a showsthe practical-playful and critical-utopic diagonalsof the semiotic square as coordinate axes. Weasked the experts to classify the commercials byassociating each commercial with a position onthe square. To ease the classification task, 25 rec-tangular regions were identified in the squarethrough a regular partition, which supports thedefinition of a video’s three different degrees ofconformity to a generic category. For instance, allvideos classified in region 4 of Figure 7a feature ahigh degree of conformity to the utopic categoryand a medium conformity to the playful one.
Figure 7 shows how the experts (Figure 7b) andthe system (Figure 7c) classified database com-mercials. Each region (cylinders of zero heightaren’t shown) in the square has a vertical cylinder,whose height is proportional to the percentage ofcommercials located in that region.
Figure 8 shows a more accurate measure of sys-tem effectiveness by displaying the agreementbetween the system and experts with reference toqueries for purely practical, critical, utopic, andplayful commercials. Each query considered thefive top-ranked commercials. For a generic com-mercial, we measured the agreement between thesystem and experts as A = 1 − d/dmax, where d is the“city block” distance between the two blocks inthe semiotic square where the system and theexperts located the commercial. dmax is the maxi-mum value of d, here 8. The best average agree-ment corresponds to the query for playfulcommercials, evidencing the effectiveness of thefeatures used to model this category. The worstperformance corresponds to queries for practicalcommercials. In this case, the system classifiedmany commercials as practical, while the expertsrated them as critical. Such a classification mis-match originates from the recurrent presence incritical commercials of foreground views of thepromoted product. The experts can easily detectand recognize a foreground view, but the systemcan’t detect this feature.
Figures 9 through 12 show examples of videoretrieval according to the expressive level using a
50
IEEE
Mul
tiM
edia
Figure 7. Representation of database commercials’
semiotic features (a) along with the experts’ (b) and
system’s (c) classification of them.
1 2 3 4 5 600
20
40
60
80
100
1 2 3 4 5 600
20
40
60
80
100
1 2 3 4 5 600
20
40
60
80
100
1 2 3 4 5 600
20
40
60
80
100
(a) (b)
(c) (d)
Average agreement
Agr
eem
ent
Retrieval ranking
Average agreement
Average agreement
Average agreement
Agr
eem
ent
Agr
eem
ent
Agr
eem
ent
Retrieval ranking
Retrieval ranking Retrieval ranking
Figure 8. Plots of the agreement between system and experts’ classification of
commercials with reference to queries for purely practical (a), critical (b),
utopic (c) and playful (d) commercials.
database of 131 commercials. Users can specifyqueries by defining the degree by which theretrieved commercials have to conform to the foursemiotic categories. Figure 9 shows the output ofthe retrieval system in response to a query forpurely playful commercials. At the right of the listof retrieved items, a bar graph displays the degreeby which each shot of the top-ranked video—thewhite, thin vertical lines represent cuts and dis-solves—belongs to the playful category. The top-ranked videos in this category all advertiseproducts for “young and smart” people (sports-wear, sport watches, blue jeans, and so on). Theseresults aren’t surprising, since they reflect thecommon marketing practice of targeting a com-mercial to a specific audience.
Figures 10 and 11 report some of the keyframesfor the two top-ranked spots in Figure 9. The firstbest-ranked spot (advertising sportswear) presentsall the typical features of a playful commercial,including a very fast rhythm, unorthodox cameratakes, situations—like skating in a tennis court—reflecting semantic issues at a higher level thanthat of our computer analysis, and very saturatedcolors. Similar features appear in the second-ranked commercial. In Figure 11 notice the pres-ence of quasi-identical keyframes (the close-ups ofthe man) typical of a nonlinear story. In this spotthe camera frenetically switches between close-upviews of the man and his dogs, almost never alter-nating details and global views like a utopic spotwould do.
Figure 12 shows the output of the retrieval sys-tem in response to a query for purely critical com-mercials. In contrast to the previous example, thetop-ranked videos obtained in response to thequery all advertise typical home products. Figures
51
July–Septem
ber 1999
Figure 9. Retrieval of playful commercials.
Figure 10. Some of the keyframes for the first-ranked spot in Figure 9 (“Sergio
Tacchini”).
Figure 11. Some of the keyframes for the second-ranked spot in Figure 9 (“Audi
A3”).
Figure 12. Retrieval of
critical commercials.
Cou
rtes
y of
RAI
Cou
rtes
y of
Med
iase
t
13 and 14 show some relevant frames for the twotop-ranked commercials in Figure 12.
Conclusions and future workIf, as the adage says, “an image is worth a thou-
sand words,” then every designer of multimediaretrieval systems is well aware that the conversealso holds true. That is, many times a word canstand for a thousand images, since it represents aclass of equivalence of objects, thus reflecting ahigher semantic level than that of the objectsthemselves. We believe that future multimediaretrieval systems will have to support access toinformation at different semantic levels to reflectdiverse application needs and user queries.
The research presented in this article attemptsto introduce different levels of signification by alayered representation of visual knowledge. Themost relevant insight we gained is that definingrules that capture visual meaning can be difficult,especially when dealing with general visualdomains. As a good design practice, such rulesshould derive from specific domain characteriza-tions, and possibly be refined and tailored to spe-cific classes of users.
Future work will address experimenting withdifferent application scenarios (such as TV newsand movies), and complementing visual featuresand language sentences with textual and audiodata. MM
AcknowledgmentsWe thank Bruno Bertelli and Laura Lombardi
for useful discussions during the development ofthis work. We’d also like to acknowledge all thoseexperts who provided their support in the perfor-mance tests.
References1. T. Joseph and A. Cardenas, “PicQuery: A High-Level
Query Language for Pictorial Database Manage-
ment,” IEEE Trans. on Software Engineering, Vol. 14,
No. 5, May 1988, pp. 630-638.
2. N. Roussopolous, C. Faloutsos, and T. Sellis, “An
Efficient Pictorial Database System for Pictorial
Structured Query Language (PSQL),” IEEE Trans. on
Software Engineering, Vol. 14, No. 5, May 1988,
pp. 639-650.
3. M. Flickner et al., “Query by Image and Video
Content: The QBIC System,” Computer, Vol. 28, No.
9, Sept. 1995, pp. 310-315.
4. J.R. Smith and S.F. Chang, “VisualSeek: A Fully
Automated Content-Based Image Query System,”
Proc. ACM Multimedia 96, ACM Press, New York,
Nov. 1996.
5. D.A. White and Ramesh Jain, “Similarity Indexing
with the SS-tree,” Proc. of the 12th Int’l Conf. on Data
Engineering, IEEE Computer Society Press, Los
Alamitos, Calif., Feb. 1996, pp. 516-523.
6. R.J. Brachman and H.J. Levesque, eds., Readings in
Knowledge Representation, Morgan Kaufmann, Los
Altos, Calif., 1985.
7. R. Arnheim, Art and Visual Perception: A Psychology of
the Creative Eye, Regents of the University of
California, Palo Alto, Calif., 1954.
8. J. Itten, Art of Color (Kunst der Farbe), Otto Maier
Verlag, Ravensburg, Germany, 1961 (in German).
9. C.R. Haas, Advertising Practice (Pratique de la
Publicité), Bordas, Paris, 1988 (in French).
52
IEEE
Mul
tiM
edia
Figure 13. Some of the keyframes for the first-ranked spot in Figure 12 (“Pasta
Barilla”).
Cou
rtes
y of
RAI
Figure 14. Some of the keyframes for the second-ranked spot in Figure 12 (“Cera
Emulsio”).
Cou
rtes
y of
Med
iase
t
10.A.J. Greimas, Structural Semantics (Sémantique
Structurale), Larousse, Paris, 1966 (in French).
11. J.-M. Floch, Semiotics, Marketing, and
Communication: Below the Signs, the Strategies
(Sémiotique, Marketing et Communication: Sous les
signes, les stratégies), University of France Press, Paris,
1990 (in French).
12.R.C. Carter and E.C. Carter, “CIELUV Color
Difference Equations for Self-Luminous Displays,”
Color Research and Applications, Vol. 8, No. 4, 1983,
pp. 252-553.
13. J.M. Corridoni, A. Del Bimbo, and P. Pala,
“Sensations and Psychological Effects in Color Image
Database,” ACM Multimedia Systems, Vol. 7, No. 3,
May 1999, pp. 175-183.
14.M. Caliani et al., “Commercial Video Retrieval by
Induced Semantics,” Proc. IEEE Int’l Work. on
Content-Based Access of Images and Video Databases,
IEEE CS Press, Los Alamitos, Calif., Jan. 1998,
pp. 72-80.
15.C. Colombo, A. Del Bimbo, and P. Pala “Retrieval of
Commercials by Video Semantics,” Proc. IEEE Int’l
Conf. on Computer Vision and Pattern Recognition
(CVPR 98), IEEE CS Press, Los Alamitos, Calif., June
1998, pp. 572-577.
Carlo Colombo is an assistant
professor in the Department of
Systems and Informatics at the
University of Florence, Italy. His
main research activities are in the
field of computer vision, with spe-
cific interests in image and video analysis, human-
machine interfaces, robotics, and multimedia. He holds
an MS in electronic engineering from the University of
Florence, Italy (1992) and a PhD in robotics from the
Sant’Anna School of University Studies and Doctoral
Research, Pisa, Italy (1996). He is a member of IEEE and
the International Association for Pattern Recognition
(IAPR), and presently serves as secretary to the IAPR
Italian Chapter.
Alberto Del Bimbo is a professor
in the Department of Systems and
Informatics at the University of
Florence, Italy. His scientific inter-
ests cover image sequence analysis,
shape and object recognition,
image databases and multimedia, visual languages, and
advanced man-machine interaction. He earned his MS
in electronic engineering at the University of Florence,
Italy in 1977. He is presently an associate editor of
Pattern Recognition, Journal of Visual Languages and
Computing, and IEEE Transactions on Multimedia. He is a
member of IEEE and IAPR. He presently serves as the
Chairman of the Italian Chapter of IAPR. He is the gen-
eral chair of IEEE Int’l. Conference on Multimedia
Computing and Systems (ICMCS 99).
Pietro Pala received his MS in
electronic engineering at the
University of Florence, Italy, in
1994. In 1998, he received his PhD
in information science from the
same university. Currently he is a
research scientist at the Department of Systems and
Informatics at the University of Florence. His current
research interests include recognition, image databases,
neural networks, and related applications.
Contact the authors at the Department of Systems
and Informatics, University of Florence, Via Santa Marta
3, I-50139, Florence, Italy, e-mail {columbus,delbimbo,
pala}@dsi.unifi.it.
53
July–Septem
ber 1999