Content-Based Image and Video Retrieval || Fundamentals of Content-Based Image and Video Retrieval

Chapter 2

FUNDAMENTALS OF CONTENT-BASED IMAGE AND VIDEO RETRIEVAL

This chapter contains basic information on Visual Information Retrieval (VIR) systems, particularly the ones whose primary emphasis is on searching for visual information (images or video clips)l based on their contents, which will henceforth be referred to as Content-Based Image and Video Retrieval (CBIVR) systems.

It introduces fundamental concepts, describes the main blocks of a CBIVR system, and discusses CBIVR systems from a user's perspective. A deeper discussion of design issues and related research work in this area is deferred to Chapter 3.

1. Basic Concepts Visual Information Retrieval is a relatively new field of research in

Computer Science and Engineering. As in conventional information retrieval, the purpose of a VIR system is to retrieve all the images that are relevant to a user query while retrieving as few non-relevant images as possible. The emphasis is on the retrieval of information as opposed to the retrieval of data. Similarly to its text-based counterpart, a VIR system must be able to interpret the contents of the documents (images) in a collection and rank them according to a degree of relevance to the user query. The interpretation process involves extracting semantic information from the documents (images) and using this information to match the user's needs [14].

1 We shall use the expression 'image' to refer to visual information entities in general or static images in particular. For image sequences, expressions such as 'image sequence', 'video clip', or simply 'video' shall be used interchangeably.

O. Marques et al., Content-Based Image and Video Retrieval

© Kluwer Academic Publishers 2002

8 CONTENT-BASED IMAGE AND VIDEO RETRIEVAL

Figure 2.1. Visual Information Retrieval blends together many research disciplines.

Progress in Visual Information Retrieval has been fostered by recent research results in many fields (Figure 2.1), including: (text-based) information retrieval, image processing and computer vision, visual data modeling and representation, man-machine interaction, multidimensional indexing, human visual perception, pattern recognition, multimedia database organization, among others.

VIR systems can be classified in two main generations, according to the attributes used to search and retrieve a desired image or video file [49J:

• First-generation VIR systems: use query by text, allowing queries such as "all pictures of red Ferraris" or "all images of Van Gogh's paintings". They rely strongly on metadata, which can be represented either by alphanumeric strings, keywords, or full scripts, and can be obtained by manual input, transcripts, captions, embedded text, or hyperlinked documents. Their performance depend on the quality of the metadata, which can very often be incomplete, inaccurate, biased by the user's knowledge, ambiguous, or a combination of these.

Fundamentals 9

• Second-generation (CB)VIR systems: support query by content, where the notion of content, for still images, includes, in increasing level of complexity: perceptual properties (e.g., color, shape, texture), semantic primitives (abstractions such as objects, roles, and scenes), and subjective attributes (such as impressions, emotions and meaning associated to the perceptual properties). The basic premise behind second-generation VIR systems is that images and videos are first-class entities and that users should be able to query their contents as easily as they query textual documents, without necessarily using manual annotation [73]. Many second-generation systems use content-based techniques as a complementary component, rather than a replacement, of text-based tools.

2. A Typical CBIVR System Architecture Figure 2.2 shows a block diagram of a generic CBIVR system, whose

main components are:

• User interface: friendly GUI that allows the user to interactively query the database, browse the results, and view the selected images or video clips.

• Query / search engine: collection of algorithms responsible for searching the database according to the parameters provided by the user.

• Digital image and video archive: repository of digitized, compressed images and video clips.

• Visual summaries: representation of image and video contents in a concise way, such as thumbnails for images or keyframes for video sequences.

• Indexes: pointers to images or video segments.

• Digitization and compression: hardware and software necessary to convert images and videos into digital compressed format.

• Cataloguing: process of extracting features from the raw images and videos and building the corresponding indexes.

Digitization and compression have become fairly straightforward tasks thanks to the wide range of hardware (cameras, scanners, frame grabbers, etc.) and software available. In many cases, images and videos are generated and stored directly in digital compressed format (typically using the standardized JPEG and MPEG compression schemes).

10

User

Image or Video

CONTENT-BASED IMAGE AND VIDEO RETRIEVAL

User interface (Querying, Browsing, Viewing)

Digitization + Compression

Query I Search Engine

Cataloguing

Figure 2.2. Block diagram of a CBIVR system.

For some specialized applications, however, the methods and devices currently used for these tasks are not suitable, and various alternative visual sensing mechanisms (stereo capture and analysis, 3-D scanners, etc.) and compression algorithms have been proposed to overcome these limitations.

The cataloguing stage is responsible for extracting features from the visual contents of the images and video clips. In the particular case of video, the original video segment is broken down into smaller pieces, called scenes, which are further subdivided into shots. Each meaningful video unit is indexed and a corresponding visual summary, typically a key-frame, is stored. In the case of images the equivalent process is known as feature extraction, which consists of extracting numerical information about the image contents. In either case, the cataloguing stage is also where metadata might be added to the visual contents. Manually adding metadata to image and video files is mandatory for text-based first-generation VIR systems. CBIVR systems, however, typically rely on minimum amount of metadata or none at all.

Fundamentals 11

Digitization, compression, and cataloguing typically happen off-line. Once these three steps have been performed, the database will contain the image and video files themselves, possible simplified representations of each image file or video segment, and a collection of indexes that act as pointers to the corresponding images or video segments.

The online interaction between a user and a CBIVR system is represented on the upper half of the diagram in Figure 2.2. The user expresses her query using a GUI. That query is translated and a search engine looks for the index that corresponds to the desired image or video. The results are sent back to the user in a way that should allow easy browsing, viewing, and possible refinement of the query based on the partial results.

3. The User's Perspective The user interface is a crucial component of a CBIVR system. Ideally

such interface should be simple, easy, friendly, functional, and customizable. It should provide integrated browsing, viewing, searching, and querying capabilities in a clear and intuitive way. This integration is extremely important, since it is very likely that the user will not always pick the best match found by the query jsearch engine. More often than not users will want to check the first few best matches, browse through them, preview their contents, refine their query, and eventually retrieve the desired image or video segment.

Most VIR systems allow searching the visual database contents in several different ways - described below - either alone or combined:

• Interactive browsing: convenient to leisure users who may not have specific ideas about the images or video clips they are searching for. Clustering techniques can be used to organize visually similar images into groups and minimize the number of undesired images shown to the user.

• Navigation with customized categories: leisure users often find it very convenient to navigate through a subject hierarchy to get to the target subject and then browse or search that limited subset of images. The combination of navigation with customized categories followed by content-based search (within a category) has been proposed by several researchers ([30], for instance) who claim it can be the most effective mode of operation: in this case, the content-based portion of the VIR system would work on a smaller, semantically constrained, subset of images.

• Query by X [31]' where 'X' can be:

12 CONTENT-BASED IMAGE AND VIDEO RETRIEVAL

an image example: several systems allow the user to specify an image (virtually anywhere in the Internet) as an example and search for the images that are most similar to it, presented in decreasing order of similarity score. It is probably the most classical paradigm of image search. Many techniques have been developed over the past years to measure similarity between the example image (template) and the target images, but these techniques still have disadvantages such as sensitivity to noise and imaging conditions, and the need of a suitable example image.

a visual sketch: some systems provide users with tools that allow drawing visual sketches of the image or video clip they have in mind. Users are also allowed to specify different weights for different features. Developers of one of these systems (VisualSEEk [156]) have observed, however, that "users are usually much less enthusiastic about this query method than others when the query interface is complex [30]."

specification of visual features: direct specification of visual features (e.g., color, texture, shape, and motion properties) is possible in some systems and might appeal to more technical users.

a keyword or complete text: some VIR systems rely on keywords entered by the user and search for visual information that has been previously annotated using that (set of) keyword(s).

a semantic class: where users specify (or navigate until they reach) a category in a preexisting subject hierarchy.

We advocate that query options should be made as simple, intuitive and close to human perception of similarity as possible. Users are more likely to prefer a system that offers the "Show me more images that look similar to this" option, rather than a sophisticated interactive tool to edit that image's color histogram and perform a new search. While the latter approach might be useful for experienced technical users with image processing knowledge, it does not apply to the average user and therefore has limited usefulness. An ideal CBIVR system query stage should, in our opinion, hide the technical complexity of the query process from the end user. We agree with Gupta, Santini, and Jain, when they state that "A search through visual media should be as imprecise as 'I know it when I see it' [74]."

4. Summary Visual Information Retrieval (VIR) is a new and dynamic field of

research, with contributions from many areas of expertise, including:

Fundamentals 13

(text-based) information retrieval, image processing and computer vision, human visual perception, pattern recognition, multimedia database organization, among others.

VIR systems can be classified in two main generations: first-generation VIR systems use query by text and rely strongly on metadata; second-generation (CB)VIR systems support query by content. Several VIR systems combine the use of content-based techniques with text-based tools. Current research has focused on improving secondgeneration systems in order to bridge the gap between the semantic meaning of an image or video scene and the raw information that is available from that image or video.

Most VIR systems allow searching the visual database contents in several different ways - either alone or combined - such as: interactive browsing, navigation with customized categories, query by example, query by visual sketch, specification of visual features, query by keyword, specification of a semantic class.

Documents

Content-Based Image and Video Retrieval || Fundamentals of Content-Based Image and Video Retrieval