Multimedia Systems (1993) 1: 10-28 Multimedia Systemssfchang/course/vis/REF/zhang... · Multimedia Systems (1993) 1: 10-28 Multimedia Systems @ Springer-Verlag 1993 a Received January

Multimedia Systems (1993) 1: 10-28Multimedia Systems@ Springer-Verlag 1993

a

Received January 10, 1993/Accepted Apri110, 1993

Abstract. Partitioning a video source into meaningful seg-ments is an important step for video indexing. We presenta comprehensive study of a partitioning system that detectssegment boundaries. The system is based on a set of differ-ence metrics and it measures the content changes betweenvideo frames. A twin-comparison approach has been devel-oped to solve the problem of detecting transitions imple-mented by special effects. To eliminate the false interpreta-tion of camera movements as transitions, a motion analysisalgorithm is applied to determine whether an actual transitionhas occurred. A technique for determining the threshold fora difference metric and a multi-pass approach to improve thecomputation speed and accuracy have also been developed.

Key words: Image difference analysis -Video partitioning-Video indexing -Multimedia

1 Introduction

Advances in multimedia technologies have now demon-strated that video in computers is a very important and com-mon medium for applications as varied as education, broad-casting, publishing, and military intelligence (Mackay andDavenport 1989). The value of video is partially due to thefact that significant information about many major aspectsof the world can only be successfully managed when pre-sented in a time-varying manner. However, the effective useof yideo sources is seriously limited by a lack of viable sys-tems that enable easy and effective organization and retrievalof information from those sources. Also, the time-dependentnature of video makes it a very difficult medium to representand manage. Thus, much of the vast quantity of video thathas been acquired sits on a shelf for future use without in-dexing (Nagasaka and Tanaka 1991). This is due to the factthat indexing requires an operator to view the entire videopackage and to assign index terms manually to each of itsscenes. Between the abundance of unindexed video and thelack of sufficient manpower and time, such an approach issimply not feasible. Without an index, information retrieval

d

SigIent.

ric,

met

met

Our

vide

from video requires one to view the source during a sequen-tial scan, but this process is slow and unreliable, particularlywhen compared with analogous retrieval techniques basedon text. Unfortunately, when it comes to the problem of de-veloping new techniques for indexing and searching of videosources, the intuitions we have acquired through the study ofinformation retrieval do not translate very well into non-textmedia.

Clearly, research is required to improve this situation, butrelatively little has been achieved towards the developmentof video editing and browsing tools (Mackay and Daven-port 1989; Tonomura 1991). A growing interest in digitalvideo has led to new approaches to image processing basedon the creation and analysis of compressed data {Netravaliand Haskell 1988; Le Gall 1991) but these activities are intheir earliest stages. The Video Classification Project at theInstitute of Systems Science (ISS) of the National Universityof Singapore {NUS) is an effort to change this situation. Itaims to develop an intelligent system that can automaticallyclassify the content of a given video package. The tangibleresult of this classification will consist of an index structureand a table of contents, both of which can be stored as partof the package. In this way a video package will becomemore like a book to anyone interested in accessing specificinformatiQn.

When text is indexed, words and phrases are used as in-dex entries for sentences, paragraphs, pages or documents.This process only selects certain key words or phrases as in-dex terms. Similarly, a video index will require, apart fromthe entries used in text, key frames or frame sequences asentries for scenes or stories. Therefore, automated indexingwill require the support of tools which can detect and iso-late such meaningful segments in any video source. Contentanalysis can then be perf9rmed on individual segments inorder to identify appropriate index terms.

In our study, a segment is defined as a single, unin-terrupted camera shot. This reduces the partitioning task todetecting the boundaries between consecutive camera shots.The simplest transition is a camera break. Figure 1 illustratesa sequence of four consecutive video frames with a camerabreak occurring between the second and third frames. TheCorrespondence to: H.J. Zhang

Hong.Jiang Zhang, Atreyi Kankanhalli, Stephen W. Smoliar

Institute of Systems Science, National University of Singapore, Heng Mui Keng Terrace, Kent Ridge,Singapore 0511, Republic of Singaporee-mail: [email protected]

II

ba

Fig. la-d. Four frames across a camerabreak from a documentary video. The firsttwo frames are in the first camera shot,and the third and fourth frame belong tothe second camera shot. There are signif-icant content changes between the secondand third frames

dc

ba c

Fig. 2a-e. Frames in a dissolve. The firstframe is the one just before the dissolvestarts, and the last one is the frame imme-diately after the end of the dissolve. The restare the frames within the dissolve

d e

significant qualitative difference in content is readily appar-ent. If that difference can be expressed by a suitable met-ric, then a segment boundary can be declared whenever thatmetric exceeds a given threshold. Hence, establishing suchmetrics and techniques for applying them is the first step inour efforts to develop tools for the automatic partitioning ofvideo packages.

The segmentation problem is also important for applica-tions other than indexing. It is a key process in video editing,where it is generally called scene change detection. It alsofigures in the motion compensation of video for compres-sion, where motion vectors must be computed within seg-ments, rather than across segment boundaries (Liou 1991).However, accuracy is not as crucial for either of these appli-

12

discuss how the use of a difference metric can be adaptedto accommodate gradual transitions and present the twin-comparison approach. A motion analysis technique for elim-inating false detection of transitions resulting from the arte-facts of camera movements is presented in Sect.3. This isfollowed by a discussion of automatic selection of thresh-old values and the multi-pass approach to improving com-putational efficiency in Sect. 4. Algorithmic implementation,evaluation of algorithm performance, and the application ofthese techniques to some "real-world" video examples aregiven in Sect. 5. Finally, Sect. 6 presents our conclusions anda discussion of anticipated future work.

2 Difference metrics for video partitioning

As observed in Sect. 1, the detection of transitions involvesthe quantification of the difference between two imageframes in a video sequence. To achieve, this, we need firstto define a suitable metric, so that a segment boundary canbe declared whenever that metric exceeds a given threshold.Difference measures used to partition video can be dividedinto two major types: the pair-wise comparison of pixels orblocks, and the comparison of the histograms of pixel values.The blocks are compared with the likelihood ratio, a statis-tic calculated over the area occupied by a given super-pixel(Kasturi and lain 1991). These metrics can be implementedwith a variety of different modifications to accommodatethe idiosyncrasies of different video sources (Nagasaka andTanaka 1991; Ruan et al. 1992) and have been used suc-cessfully in camera break detection. Details will now be dis-

cussed.

2.1 Pair-wise comparison

A simple way to detect a qualitative change between a pairof images is to compare the corresponding pixels in the twoframes to determine how many pixels have changed. Thisapproach is known as pair-wise comparison. In the sim-plest case of monochromatic images, a pixel is judged aschanged if the difference between its intensity values in thetwo frames exceeds a given threshold t. This metric can berepresented as a binary function DPi(k,l) over the domainof two-dimensional coordinates of pixels, (k,I), where thesubscript i denotes the index of the frame being comparedwith its successor. If Pi(k, 1) denotes the intensity value ofthe pixel at coordinates (k, 1) in frame i, then DPi(k, 1) maybe defined as follows:

1 if IPi(k,l) -Pi+l (k, 1)1 >O otherwise

:1)

The pair-wise segmentation algorithm simply counts thenumber of pixels changed from one frame to the next ac-cording to this metric. A segment boundary is declared ifmore than a given percentage of the total number of pixels

cations; for example, in compression, false positives only in-crease the number of reference frames. By contrast, in videosegmentation for indexing, such false positives would haveto be corrected by manual intervention. Therefore, high ac-curacy is a more important requirement in automating the

process.A camera break is the simplest transition between two

shots. More sophisticated techniques include dissolve, wipe,fade-in, and fade-out (Bordwell and Thompson 1993). Suchspecial effects involve much more gradual changes betweenconsecutive frames than does a camera break. Figure 2 showsfive frames of a dissolve from a documentary video: theframe just before the dissolve begins, three frames withinthe dissolve, and the frame immediately after the dissolve.This sequence illustrates the gradual change that downgradesthe power of a simple difference metric and a single thresh-old for camera break detection. Indeed, most. changes areeven more gradual, since dissolves usually last more thanten frames. Furthermore, the changes introduced by cam-era movement, such as pan and zoom, may be of the sameorder as that introduced by such gradual transitions. Thisfurther complicates the detection of the boundaries of cam-era shots, since the artefacts of camera movements must bedistinguished from those of gradual shot transitions.

While there has been related work on camera break de-tection by other researchers (Tonomura 1991; Nagasaka andTanaka 1991; Ruan et al. 1992), few experimental data havebeen reported. Also, the implementation of the differencemetrics for practical video processing and an automatic de-termination of the cutoff threshold have not yet been ad-dressed in the current literature. This paper presents a com-prehensive experimental study of three difference metrics forvideo partitioning; pair-wise comparison of pixels, coinpar-ison of pixel histograms, and "1ikelihood ratio" comparison,where the likelihood ratio metric has not been previouslydiscussed. It also discusses the automatic selection of thethreshold for a chosen difference metric, which is a key is-sue in obtaining high segment accuracy, and a multi-pass

technique that upgrades performance efficiency.So far, there has been no report on algorithms that suc-

cessfully detect gradual transitions. As a major contributionto the video segmentation problem, we suggest that thisproblem can be solved by a novel technique called twin-comparison. This technique first uses a difference metricwith a reduced threshold to detect the potential frames wherea gradual transition can occur; then the difference metric isused to compare the first potential transition frame with eachfollowing frame until the accuinulated difference exceeds asecond threshold. This interval is then interpreted to delineatethe start and end frames of the transition. Motion detectionand analysis techniques are then applied to distinguish cam-era movements from such gradual transitions. Experimentsshow that the approach is very effective and that it achievesa very high level of accuracy.

Before discussing the detection of gradual transitions, wefirst present a set of the difference metrics and their appli-cations in detecting camera breaks in Sect. 2. In Sect. 3 we

13

(given as a threshold T) have changed. Since the total num-ber of pixels in a frame of dimensions M by N is M *N , thiscondition may be represented by the following inequality:

2.3 Histogram comparison

~~i~ DPi(k, 12-~ * 100 > T (2)

M*N

A potential problem with this metric is its sensitivity to cam-era movement. For instance, in the case of camera panning,a large number of objects will move in the same directionacross successive frames; this means that a large number ofpixels will be judged as changed even if the pan entails ashift of only a few pixels. This effect may be reduced bythe use of a smoothing filter: before comparison each pixelin a frame is replaced with the mean value of its nearestneighbours. (We are currently using a 3 x 3 window centredon the pixel being "smoothed" for this purpose.) This alsofilters out some noise in the input image~.

2.2 Likelihood ratio

To make the detectioQ of camera breaks more robust, insteadof comparing individual pixels we can compare correspond-ing regions (blocks) in two successive frames on the basis ofsecond-order statistical characteristics of their intensity val-ues. One such metric for comparing corresponding regionsis called the likelihood ratio (Kasturi and Jain 1991). Let miand mi+l denote the mean intensity values for a given re-gion in two consecutive frames, and let Si and Si+l denotethe corresponding variances. The following formula com-putes the likelihood ratio and determines whether or not itexceeds a given threshold t:

1~+(~)J2 (3)Si * Si+l

An alternative to comparing corresponding pixels or regionsin successive frames is to compare some feature of the en-tire image. One such feature that can be used in segmenta-tion algorithms is a histogram of intensity levels. The prin-ciple behind this algorithm is that two frames having anunchanging background and unchanging objects will showlittle difference in their respective histograms. The histogramcomparison algorithm should be less sensitive to object mo-tion than the pair-wise pixel comparison algorithm, sinceit ignores the spatial changes in a frame. One could arguethat there may be cases in which two images have similarhistograms but completely different content. However, theprobability of such an evint is ~ufficiently low that, in prac-tice, we can tolerate such errors.

Let Hi(j) denote the histogram value for the ith frame,where j is one of the G possible grey levels. (The numberof histogram bins can be chosen on the basis of the avail-able grey-level resolution and the desired computation time.)Then the difference between the ith frame and its successorwill be given by the following formula:

GSDi = L IHi(j) -Hi+l(j)1 .(4)

j=l

If the overall difference SDi is larger than a given thresh-old T, a segment boundary is declared. To select a suitablethreshold, SDi can be normalized by dividing it by the prod-uct of G and M * N, the number of pixels in the frame.

Figure 3 shows grey-level histograms of the first threeimages shown in Fig. 1. Note the difference between thehistograms across the camera break between the second andthe third frames, while the histograms of the first and secondframes are almost identical. Figure 4 illustrates the applica-tion of histogram comparison to a documentary video. Thegraph displays the sequence of SDi values defined by Eq.4between every two consecutive frames over an excerpt fromthis source. Equation 4 was applied to grey-level intensi-ties computed from the intensities of the three colour chan-nels by the National Television System Committee (NTSC)conversion formula (Sect. 5.1). The graph exhibits two highpulses that correspond to two camera breaks. If an appropri-ate threshold is set, the breaks can be detected easily.

Equation 4 can also be applied to histograms of individ-ual colour channels. A simple but effective approach is to usecolour histogram comparison (Nagasaka and Tanaka 1991;Zhang et al. 1993): Instead of grey levels, j in Eq.4 denotesa code value derived from the three colour intensities of apixel. Of course, if 24 bits of colour data were translatedinto a 24-bit code word, that would create histograms with224 bins, which is clearly unwieldy. Consequently, only thetwo or three most significant bits of each colour componenttend to be used to compose a colour code. A 6-bit code,providing 64 bins, has been shown to give sufficient accu-racy. This approach is also more efficient for colour sourcematerial because it eliminates the need to first convert the

Camera breaks can now be detected by first partitioning theframe into a set of sample areas. Then a camera break can bedeclared whenever the total number of sample areas whoselikelihood ratio exceeds the threshold is sufficiently large(where "sufficiently large" will depend on how the frameis partitioned). An advantage that sample areas have overindividual pixels is that the likelihood ratio raises the levelof tolerance to slow and small object motion from frameto frame. This increased tolerance makes it less likely thateffects such as slow motion will mistakenly be interpretedas camera breaks.

The likelihood ratio also has a broader dynamic rangethan does the percentage used in pair-wise comparison. Thisbroader range makes it easier to choose a suitable thresholdvalue t for distinguishing changed from unchanged sampleareas. A potential problem with the likelihood ratio is thatif two sample areas to be compared have the same meanand variance, but completely different probability density~nctions, no change will be detected. Fortunately, such a

1)

if

600....~=

~ 400.

Q)(;: 200-

ering objects (such as video monitors) are common sourcesof errors, as will be seen in Sect.5.

When the illumination does not change over the entireframe, the problem can be overcome by a robust approachdeveloped by Nagasaka and Tanaka (1991) to accommodatemomentary noise. It is based on the assumption that suchnoise usually influences no more than half of an entire frame.Therefore, a frame can be divided into a 4 x 4 grid of 16rectangular regions; and, instead of comparing entire frames,corresponding regions are compared. This yields 16 differ-ence values, and the camera break detection is only basedon the sum of the eight lowest difference values. Such a se-lection of data is meant to eliminate the errors introduced bymomentary noise. We plan to test this technique, but we an-ticipate that its performance might be made more robust bytaking the sum of the median eight difference values (elim-inating the four extremes from both ends).

c -c- -~I I I I I I I I I I I I I I I I I -

iii~a;Nna;-~N N

Gray1eve1

0.

600.~~13 400.

~0: 200-

0

.~

I I I I I I I I I I I I I I I I

W;.OO'NInm VN N

Gr.,level

3 Gradual transition detection

We now present the twin-comparison approach that adapts adifference metric to accommodate gradual transitions, whichis our main contribution to automatic video partitioning. Thisdiscussion will be based on the histogram-comparison dif -

ference metric. We also discuss how a technique based onmotion detection and analysis can be used to identify inter-vals of camera movement that may be confused with gradualtransitions between camera shots.

600..c~o 400

u

'"0;)( .c: 200.

0 1. I I I I I I' I I I ~;;.oa-NU)U; V

N N

G,..,level

Fig.3. Histograms of grey-Ievel pixel values corresponding to the

first three frames shown in Fig. 1

Nagasaka and Tanaka (1991) have also used the follow-

ing x2-test equation,

GSDi = L IHi(j) -Hi+I(j)f

j:1 Hi+I(j)(5)

to make the histogram comparison reflect the difference be-tween two frames more strongly. However, our experimentsshow that while this equation enhances the difference be-tween two frames across a camera break, it also increasesthe difference between frames representing small changesdue to camera or object movements. Therefore, the over-all performance is not necessarily better than that achievedby using Eq.4, while Eq.5 also requires more computationtime.

3.1 The twin-comparison approach for detecting special

effects

As one can observe from Fig.4, the graph of the frame-to-frame histogram differences for a sequence exhibits two highpulses that correspond to two camera breaks. It is easy toselect a suitable cutoff threshold value (such as 50) for de-tecting these camera breaks. However, the inset of this graphdisplays another sequence of pulses the values of which arehigher than those of their neighbous but are significantlylower than the cutoff threshold. This inset displays the dif -

ference values for the dissolve sequence shown in Fig. 2 andillustrates why a simple application of this difference metricis inadequate.

The simplest approach to this problem would be to lowerthe threshold. Unfortunately, lower thresholds cannot be ef-fectively employed, because the difference values that occurduring the gradual transition implemented by a special ef -

fect may be smaller than those that occur between the frameswithin a cmnera shot. For example, object motion, camerapanning, and zooming also entail changes in the computeddifference value. If the cutoff threshold is too low, suchchanges may easily be registered as "false positives". Theproblem is that a single threshold value is being made to ac-count for all segment boundaries, regardless of context. Thisappears to be asking too much of a single number, so a newapproach has to be developed.

2.4. Limitations

In addition to the weaknesses of each of the individual met-rics that have already been cited, all three types of differencemetric face a severe problem if there are moving objects (ofeither large size or of high speed) or a sharp illuminationchange between two frames in a common $hot, resulting ina false detection of a camera break. Flashing lights and flick-

15

80

7015T

10.

=QE...~0

..."

:

- N ,., ~

~ , ," "

Fig. 4. A sequence of frame-to-frame histogram dif-

ferences obtained from a documentary video, wheredifferences corresponding to both ' camera breaks and

transitions implemented by special effects can be ob-

served

0

.0 U) O N V -0

Frame Number

~ ON

NN

~N

.0N

N ~

SDp.qTb

--IIIII~II~I~I IH lHIIH,~I~,T,- frame

FB Fs I Fe

T.

a

I -- SD' --p.q I

Tb

frame

b Fs Fe

Fig.5a,b. Dlustration oftwin-comparison. SDp,q. the difference be-tween consecutive frames defined by the difference metric; SD~,q,the accumulated difference between the current frame and the po-tential starting frame of a transition; T.. the treshold used to detectthe starting frame (F .) of a transition; Tb, the thresold used todetect the ending frame (Fe) of a transition. Tb is also used to de-tect camera breaks and FB is such a camera break. SD~,q is onlycalculated when SDp,q > Ts

gradual transitions. The key idea of twin-comparison is thattwo distinct threshold conditions be satisfied at the sametime. Furthermore, the algorithm is designed in such a waythat gradual transitions are detected in addition to ordinarycamera breaks.

A problem with twin-comparison is that there are somegradual transitions during which the consecutive differencevalue does fall below Ts. This problem is solved by permit-ting the user to set a tolerance value that allows a number(such as two or three) of consecutive frames with low dif-ference values before rejecting the transition candidate. Thisapproach has proven to be effective when tested on real video

examples.

In Fig. 2 it is obvious that the first and the last frameare different, even if all consecutive frames are very similarin content. In other words the difference metric applied inSect. 2.3 with the threshold derived from Fig. 4 would stillbe effective were it to be applied to the first and the lastframe directly. Thus, the problem becomes one of detectingthese first and last frames. If they can be determined, theneach of them may be interpreted as a segment boundary , andthe period of gradual transition can be isolated as a segmentunto itself. The inset of Fig.4 illustrates that the differencevalues between most of the frames during the dissolve arehigher, although only slightly, than those in the precedingand following segments. What is required is a threshold valuethat will detect a dissolve sequence and distinguish it from anordinary camera shot. A similar approach can be applied totransitions implemented by other types of special effects. Weshall present this twin-comparison approach in the context ofan example of dissolve detection using Eq. 4 as the differencemetric.

Twin-comparison requires the use of two cutoff thresh-olds: Tb is used for camera break detection in the same man-ner as was described in Sect.2. In addition, a second, lower,threshold T s is used for special effect detection. The detec-tion process begins by comparing consecutive frames usinga difference metric such as Eq.4. Whenever the differencevalue exceeds threshold Tb, a camera break is declared, e.g.,FB in Fig. Sa. However, the twin-comparison also detectsdifferences that are smaller than Tb but larger than T8. Anyframe that exhibits such a difference value is marked as thepotential start (F 8) of a gradual transition. Such a frame is la-belled in Fig. Sa. This frame is then compared to subsequentframes, as shown in Fig. Sb. This is called an accumulatedcomparison since, during a gradual transition, this differencevalue will normally increase. The end frame (F s) of the tran-sition is detected when the difference between consecutiveframes decreases to less than T 8' while the accumulated com-parison has increased to a value larger than Tb.

Note that the accumulated comparison is only computedwhen the difference between consecutive frames exceeds T8.If the consecutive difference value drops below T s before the

accumulated comparison value exceeds Tb, then the potentialstart ooint is droooed and the search continues for other

3.2 Distinguishing special effects from camera movements

Given the ability to detect gradual transitions such as thosethat implement special effects, one must distinguish changesassociated with those transitions from changes that are intro-

60

50

40

30

20

16

'V ,every column the magnitude of the difference between thesevertical componerits will always exceed the magnitude ofboth components. That is,

Iv~P -v~ttoml ~ max (Iv~PI ' Iv~ttoml) .(7.1)

One may again modify this "always" condition with a toler,.ance value, but this value can generally be low, since theretends to be little object motion at the periphery of a frame.The horizontal components of the motion vectors for theleft-most and right-most columns can then be analysed thesame way. That is, the vectors of two blocks located at theleft-most and right-most columns but at the same row willsatisfy the following condition:

lu~P -u~ttoml ::::: max (lu~PI ' lutottoml) (7.2)

duced by camera panning or zooming. Changes due to cam-era movements tend to induce successive difference valuesof the same order as those of gradual transitions, so that theproblem cannot be resolved by introducing yet another cut-off threshold value. Instead, it is necessary to detect patternsof image motion that are induced by camera movement,

The specific feature that serves to detect camera move-ments is opticalflow, a technique rooted in computer vision.The optical flow fields resulting from panning and zoomingare illustrated in Fig. 6. In contrast, transitions implementedby special effects, such as dissolve, fade-in and fade-out,will not introduce such motion fields. Therefore, if such mo-tion vector fields can be detected and analysed, changes in-troduced by camera movements can be distinguished fromthose due to special-effect transitions.

Optical flow computation involves representing the dif-ference between two consecutive frames as a set of motionvectors. As is illustrated in Fig. 6a, during a camera pan thesevectors will predominantly have the same direction. (Clearly,if there is also object movement in the scene, not all vectorsneed share this property.) Thus, the distribution of motionvectors in an entire frame resulting from a camera'panningshould exhibit a single strong modal value that correspondsto the movement of the camera. In other words, most of themotion vectors will be parallel to the modal vector. This

leads to

A zoom is said to occur when both conditions 7.1 and 7.2

tare satisfied for the majority of the motion vectors.

The field of motion vectors that is required for this ana- jlysis can be computed by the block-matching algorithm com- !

monly used for motion compensation in video compression(Netravali and Haskell 1988). This algorithm requires thepartitioning of the fraDle into blocks, and motion vectors arecomputed for each block by finding the minimum value of acost function over a set of trial vectors. Suppose each blockis M lines high and N pixels wide; and, within a block,let b(m, n) denote the intensity of the pixel with coordinates(m, n). Then one may define a cost function for block k be- jtween two consecutive fraDles, i and i + 1, in terms of an )error-power function F as follows:

(6)N

L IOk -Om I $ epk

where Ok is the direction of motion vector k, Om is the di-rection of the modal vector, and N is the total number ofmotion vectors in a frame. IOk -Omi is zero when the twovectors are exactly parallel. Equation 6 thus counts the totalvariation in direction of all motion vectors from the direc-tion of the modal vector, and a camera pan is declared if thisvariation is smaller than ep..Ideally, ep should be zero, butwe use a non-zero threshold to accommodate errors such asthose that may be introduced by object motion.

In the case of zooming, ~:(i~M-Qf~otionv~tors has aminimum value at the focus centre, focus of expansion (FOE)in the case of zoom out, or focus of contraction (FOC) inthe case of zoom in. Indeed, if the focus centre is locatedin the centre of the frame and there is no object movement,then tbe mean of all the motion vectors will be the zerovector. Unfortunately, determining an entire frame of motionvectors with high spatial resolution is a very time consumingprocess, so locating a focus centre is no easy matter.

Since we are only interested in determining that a zoomtakes place over a sequence of frames, it is not necessaryto locate the focus centre. If we assume that, in general, thefocus centre of a camera zoom will lie within the bound-ary of a frame, we may apply a simpler vector comparisontechnique. In this approach we compare the vertical com-ponents of the motion vectors for the top and bottom rowsof a frame, since during a zoom these vertical componentswill have opposite signs. Mathematically, this means that in

M N

Pi(k) = LLF[bi(m,n) -bi+l(n,m)] .(8)

m n

Given a set of trial vectors, the vector that minimizes Pi(k)can ~ as~igned as the moti~n vector for block k. ~e cal- 1culation time to find a motion vector depends mainly on 1the number of trial vectors, which, in turn, is based on thepossible maximum velocity of camera movement and theresolution of the field of motion vectors depends on the sizeof the blocks compared.

Apart from the approach just presented, one can alsoidentify camera motion based on sophisticated optical flowanalyses (Kasturi and lain 1991). However, such a tech-nique will require a much greater density of motion vectors(one vector per pixel, in principle). Computing such a res-olution of plotion vectors is very time consuming, requir-ing either iterative refinement of a gradient-based algorithm(Horn and Schunck 1981) or the construction of a hierarchi-cal framework of cross-correlation (Anandan 1989). Moreimportantly, although optical flow calculations have beenunder investigation by many researchers, besides the diffi-culties in obtaining high accuracy and robustness, the re-covery of motion from optical flow and the application ofsmoothness constraints are still research issues (Kasturi andlain 1991). Therefore, instead of waiting for more reliablealgorithms for recovering camera motion from optical flow,

17

I\/

,~ '"'~ I~ ./""" ~

~~

I /'/tt\"',

! / I i \ , ;~~~

a b c

Fig. 6a-c. Motion vector patterns resulting from camera panning and zooming. a / Camera panning direction. b Camera zoom-out.

c Camera zoom-in

"

our current, more limited, motion vector analysis algorithmcan provide reasonable accuracy, as one can see from theexperimental results presented in Sect.5.

In spite of the computational expense, there are still ad-vantages to working with the block-matching algorithm. Inthe future we plan to make use of motion vectors retrievedfrom files of vi~eo that have been compressed by the motion-compensation technique. Such files can be computed alreadyduring a real-time scan of video input by video compres-sion hardware, such as an MPEG (Moving Pictures ExpertsGroup) chip (Ang et al. 1991).

4 Applying the comparison techniques

Before the difference metrics for transition detection can beapplied to full-motion video, two practical issues have to beconsidered. First, how does one select the appropriate thresh-old values for a given video source and difference metric?Secondly, comparing each pair of consecutive frames of avideo source (54000 frames for half an hour's worth of ma-terial) is a very time consuming process; hence, a schemeto speed up the partitioning process needs to be considered.We discuss the solutions to these two issues.

4.1 Threshold selection

,determining a segment boundary using any of the algorithmsvaries from one video source to another. For instance, "cam-era breaks" in a cartoon film tend to exhibit much largerframe~to-frame differences than those in a "live" film. Ob-viously, the threshold to be selected must be based on thedistribution of the frame-to-frame differences of the video

sequence.Considerable research has been done on the selection of

thresholds for the spatial segmentation of static images. Agood summary can be found in Digital Picture Processingby Rosenfeld and Kak (1982). Typically, selecting thresholdsfor such spatial segmentation is based on the histogram ofthe pixel values of the image. The conventional approachesinclude the use of a single threshold, multiple thresholds andvariable thresholds. The accuracy of single threshold selec-tion depends upon whether the histogram is bimodal, whilemultiple threshold selection requires clear multiple peaks inthe histogram. Variable threshold selection is based on localhistograms of specific regions in an image. In spite of thisvariety of techniques, threshold selection is still a difficultproblem in image processing and is most successful when thesolution is application dependent. In order to set an appropri-ate threshold for temporal segmentation of video sequences,we draw upon the same feature, i.e., the histogram of theframe-to-frame differences. Thus, it is necessary to know thedistribution of the frame-to-frame differences across camerabreaks and gradual transitions.

The automatic selection of threshold Tb is based on thenormalized frame-to-frame differences over an entire givenvideo source. The dashed curve (M) in Fig. 7 shows a typicaldistribution of difference values. This particular example isbased on the difference metric for comparison of colour codehistograms obtained from a documentary video. The rangeof difference values is given on the horizontal axis, and thefrequency of occurrence of each difference value is repre-sented as a percentage of the total number of frame-to-framedifferences on the vertical axis. This particular distributionexhibits a high and sharp peak on the left corresponding toa large number of consecutive frames that have a very smalldifference between them. The long tail to the right corre-sponds to the small number of consecutive frames betweenwhich a significant difference occurs. Because this histogramL~- --1.. ~ n:-~I~ ~~Anl ~~;nt th~ "~~..n",,h~., fn.. thrp.,hnlti

Selection of appropriate threshold values is a key issue inapplying the segmentation algorithms described in Sect. 2and 3. Thresholds must be assigned that tolerate variationsin individual frames while still ensuring a desired level ofperformance. A "tight" threshold makes it difficult for "im-postors" to be falsely accepted by the system, but at the riskof falsely rejecting true transitions. Conversely, a "loose"threshold enables transitions to be accepted consistently, atthe risk of falsely accepting "impostors". In order to achievehigh accuracy in video partitioning, an appropriate thresholdmust be found.

The threshold t, used in pair-wise comparison for judg-ing whether a pixel or super-pixel has changed across suc-cessive frames, can be easily determined experimentally andit does not change significantly for different video sources.J:r,,",0"o~ : L -L .L-. .L- .1.~~nl.~I" '1'- ~~~

30

25

20

15

'M\

~10

5

\G

---~ 111~:-:II=rFI=t=t=1=1=j=1111IiIIIIIIIIIIIIIIIIIIIIIIII11111111111111111

~~ (\I (') -or to <D

oL

Fig. 7. M, distribution of computed frame-to-frame differences based on a colour code histogram for all frames of the documentary videcG, Gaussian distribution derived from the mean and variance of distribution M

for 99.9% of all difference values. The black curve (G) inFig. 7 shows the Gaussian distribution obtained from a andJL for the frame-to-frame differences from which the M curvewas calculated. Therefore, the threshold Tb can be selectedas

Tb 10)J1,+aO"

That is, difference values that fallout of the range from O toJL + aa can be considered indicators of segment boundaries. IFrom our experiments, the value a should be chosen between Ifive and six when the histogram comparison metric is used. IUnder a Gaussian distribution, the probability that a non- Itransition frame will fallout of this range is practically zero.

For detecting gradual transitions, another threshold, T8,defined in Fig. 5, also needs to be selected. Experiments haveshown that T 8 should be selected along the right slope ofthe M distribution shown in Fig. 7. Furthermore, T 8 shouldgenerally be larger than the mean value of the frame-to-frame differences of the entire video package. After examin-ing three documentary videos, we observed that T 8 does notvary significantly; and a typical value usually lies betweeneight and ten.

selection already mentioned are not applicable. To solve thisproblem, we propose an alternative approach to selecting thethreshold value Tb as follows.

If there is no camera shot change or camera movementin a video sequence, the frame-to-frame difference value can-only be due to three sources of noise: noise from digitizingthe original analog video signal, noise introduced by videoproduction equipment, and noise resulting from the physicalfact that few objects are perfectly still. All three sources ofnoise can be assumed to be Gaussian. Thus, the distribu-tion of frame-to-frame differences can be decomposed intoa sum of two parts: the Gaussian noises and the differencesintroduced by camera breaks, gradual transitions, and cam-era movements. Obviously, differences due to noise havenothing to do with transitions. Statistically, the second sumaccounts for less than 15% of the total number of framesin a documentary video that we have used as a source ofexperimental data.

Let a be the standard deviation and 11 the mean of theframe-to-frame differences. If the only departure from 11 isdue to Gaussian noise, then the' probability integral

P(x) = 1 1 -~ A--e 2" U;{; (9)

":J21j:0"

4.2 Multi-pass approach(taken from O since all differences are given as absolutevalues) will account for most of the frames within a fewstandard deviations of the mean value. In other words, theframe-to-frame differences from the non-transition frameswill fall in the range of O to JL + aa for a small constantvalue a. For instance, if we chose a = 3, Eq.9 will account

Once threshold values have been established, the p~tion-ing algorithms can be applied in "delayed real-ti.Jne". Onlyone pass through the entire video package is required to de-tennine all the camera breaks. However, the delay can bequite substantial given 30 fps of colour source material. In

19

algorithms are applied independently to the potential bound-aries detected by the first pass. The results from the twoalgorithms can then be used to verify each other, and posi-tive results from both algorithms will have sufficiently highconfidence to be declared as segment boundaries. As will beseen from the following experimental results, the multi-passapproach can achieve both high speed and accuracy.

Adaptive sampling in time can also be applied in thesegmentation process to reduce processing time. That is, weonly examine those frames with a high likelihood of crossinga segment boundary and skip the frames with a low likeli-hood. For instance, we can skip a number of frames after acamera shot boundary is detected since there are not likelyto be two boundaries .in very close succession. However,in general, it is difficult to obtain a priori knowledge aboutsuch likelihood. More importantly, we currently calculatethe mean and variance for those frame-to-frame differencestaken oyer an entire video sequence and sampled accordingto a constant skip factor; and those mean and variance val-ues are used to determine the thresho,ld for segment bound-aries. That is, we need to calculate the threshold from thevideo sequence before the sequence is partitioned. An adap-tive sampling technique that introduces variable skip factors,would destroy the statistical properties of those calculations,possibly resulting in an inadequate threshold. In addition,variable thresholds would be needed if adaptive samplingwere to be applied, which would further complicate the pro-cessing. However, the detection process in the second pass(when the two-pass approach is used with the same metricin both passes) is, in fact, an adaptive sampling process intime, as only the frames close to the potential boundaries areexamined in the second pass, while other frames between thepotential boundaries are skipped. In this case, however, thethreshold value is obtained from the first pass.

addition, such a single-pass approach has the disadvantagemat it does not exploit any information other than the thresh-old values. Therefore, this approach depends heavily on theselection of those values.

A straightforward approach to the reduction of process-ing time is to lower the resolution of the comparison. Thiscan be done in two ways -either spatially or temporally.In the spatial domain, one may sacrifice resolution by ex-amining only a subset of the total number of pixels in eachframe. However, this is clearly risky since if the subset istoo small, the loss of spatial detail may result in a failureto detect certain segment boundaries. Also, resampling theoriginal image may even increase the processing time, andmat would run contrary to our goal of applying spatial re-sampling to reduce the processing time.

Alternatively, to sacrifice temporal resolution by ~xam-ining fewer frames is a better choice, since in motion videotemporal information redundancy is much higher than spa-tial information redundancy. For example, one could apply a"skip factor" of ten (examining only three frames Is of videotime from a 30-frames/s source). This will reduce the num-ber of comparisons (and, therefore, the associated processingtime) by the same factor. An advantage of this approach isthat it will also detect some of the transitions implementedby special effects as camera breaks, since the difference be-tween two frames that are ten frames apart across such atransition could be larger than the threshold Tb. A drawbackof this approach is that the accuracy of locating the cam-era break decreases with the same factor as the skip. Also,if the skip factor is too large, the change during a cameramovement may be so great that it leads to a false detectionof a camera break. (This was observed experimentally in asystem that was limited to "grabbing" only one frame/s.)

To overcome the problems of low resolution in locat-ing segment boundaries, we have developed a multi-passapproach that improves processing speed and achieves thesame order of accuracy. In the first pass resolution is sacri-ficed temporally to detect potential segment boundaries. Inthis process twin-comparison for gradual transitions is not

applied. Instead, a lower value of Tb is used; and all framesacross which there is a difference larger than Tb are detectedas potential segment boundaries. Due to the lower thresholdand large skip factor, both camera breaks and gradual tran-sitions, as well as some artefacts due to camera movement,will be detected; but any number of false detections will alsobe admitted, as long as no real boundaries are missed. In thesecond pass all computation is restricted to the vicinity ofthese potential boundaries. Increased resolution is used tolocate all boundaries (both camera breaks and gradual tran-

sitions) more accurately. Also, motion analysis is applied todistinguish camera movements from gradual transitions..With the multi-pass approach different detection algo-

nthms can be applied in different passes to increase confi-dence in the results. For instance, for a given video package,we can apply either pair-wise or histogram comparison inthe first, low resolution, pass with a large skip factor and a1 , -~-~--'---

5 Implementation and evaluation

The partitioning algorithms described have been imple-mented and applied on a variety of video materials. Theoutput of the segmentation process of a given video packageis a sequence of segment boundaries. These boundaries arecompared with those detected manually from the same videosources as a basis for evaluation. We shall now discuss theimplementation of these algorithms, the experimental results,and the comparative value of each approach.

5.1 System and algorithm implementation

The implementation platfonn consists of a Macintosh llfxconnected to a Pioneer LD- V8000 video laser disc player, asshown in Fig. 8. The interface between the laser disc playerand the computer is provided by a Raste{()ps Colorboard 364video board, which combines the functions of a true colourframe buffer with hardware pan and zoom, "frame grabbing",and real-time video display at a rate of 30 frames/s. A soft-ware control panel for the video disc player has been devel-

:I n~rI ;.,t ,tAtt ",;th " II~PT-frip.nrllv interface to control

20

difference values. Then the segmentation threshold is deter-mined from these two quantities, as described in Sect. 4.2.Finally, segment boundaries are detected using the differencevalues in the array.

frame searching, input of algorithm parameters, monitoringof the segmentation process and display and storage of seg-mentation results.

During segmentation, the frames to be compared are di-rectly grabbed from the laser disc rather than stored as afile of compressed digital video. 111e frames are of resolu-tion 278 x 208 with each pixel represented by 32 bits ofdata: 8 bits for each of the three colour components andthe first 8 bits unused. 111e colour intensities can then betransformed into a single 8-bit value of grey level or a 6-bitcolour code, composed from the most significant 2 bits ofeach of the three colour components, as illustrated in Fig.9.If grey-level intensities are used in the segmentation, theyare computed from the three colour components accordingto the NTSC standard:

5.2 Experimental results

I = O.299R + O.587G + O.114B

~er Di,s~ PJ~l~

Coneer ~

Displa:

I

Fig. 8. Architecture of the automatic video segmentation system

Red

Green

BlueColourcode

The algorithms described in this paper have been tested ona number of video sequences, including documentaries, ed-ucational material, and cartoons. However, we shall onlypresent the results of the partitioning of a 7-min animatedcartoon and two documentaries: The first is a 20-min 10-cal production about the Faculty of Engineering at NUS;the second is a 40-min commercial travelogue about Singa-pore. These video packages provide representative materialfor testing the 'algorithms. The cartoon employs few specialeffects, so while all "camera shots" are actually simulationsin animation, they are all separated relatively conventionally.Therefore, this was excellent material for the initial testingand evaluation of our algorithms for camera break detection.The NUS documentary video provides a combination of livefootage and a few animated sequences, along with the useof various special effects. The entre documentary consistsof approximately 200 camera shots, and the transitions areimplemented by both dissolves and camera breaks. Thereare also several sequences of camera panning and zooming,as well as object motion. This video thus provides suitablematerial for test and evaluation of algorithms for both cam-era break and gradual transition detection, as well as for thetechnique for camera movement detection. In the travel doc-umentary there are many more camera breaks and cameramovements, as well as more sequences where moving ob-jects and lighting changes arising from flashing lights andthe flickering of some objects occur. Thus, this video pro-vides more data for testing the robustness and limitations ofthe partitioning algorithms.

Three experiments will now be discussed. The first is forproof-of-concept, testing the effectiveness of the differencemetrics for segment boundary detection on the cartoon video.The second one presents both a much more in-depth analysisof the difference metrics and a test of twin-comparison fordetecting gradual transitions. The comparison of histogramsbased on colour codes is further analysed by subjecting thecomplete travel documentary video to both twin-comparisonand subsequent analysis for camera movement detection. IPresentation of these results will include a detailed erroranalysis. The final experiment concentrates on the effective-ness of the camera movement detection techniques presentedin Sect.3.2.

Fig.9. Diagrammatic representati{)n of colour code generation

In this fonnula R, G and B stand for the intensities of thered, green and blue components, respectively. This fonnulaalso gives the impact of individual colour changes on theoverall grey level and indicates the significance of the greencomponent. To increase the processing speed by avoiding thecomputation of this transfonnation, it is possible to computedifferences using only the green component. However, thisapproach will miss any difference due to large changes inthe red and blue components but a small change in the greencomponent. Whether or not this infonnation loss can be tol-erated will depend on the nature of the video source.

Input to the segmentation system is provided by grabbingthe frames for the sequence to be partitioned and digitizingeach frame in the video board. The system first processesthese input data by calculating the frame-to-frame differ-ences between those frames detennined by the skip factor;these difference values are then stored in an array. This ar-ray is then used to compute the mean and variance of those

5.2.1 Example

video

Camera break detection on a cartoon

Table 1 lists the results of applying each of four 4ifferencemetrics in a single pass to partition the cartoon video (12990frames). Since we were only concerned with testing the dif-ference metrics, we were only interested in detecting camera

21

5.2.2 Example 2: Transition detection on a documentary

video

Table 1. Segmentation results of single pass algorithms

Nm Nf r (seconds)

71320

103023

32475

20861

Tb Nd

51494448

-

Algorithms~-

2495

Pair-wise pixel, greyLikelihood, greyHistogram, greyHistogram, colour-Tb, the threshold selected; Nd, the total number of camera breakscorrectly detected by each algorithm; Nm, the number of bound-aries missed; N f ' the number of boundaries misdetected; T, the

processing time

Table 2. Segmentation results of multi-pass algorithms

Nf 'T (seconds)~-

17386

19546

15752

22183

Tb Nd

49

49

43

42

N",

44

1011

Algorithms-

Pair-wise pixel. grey

Likelihood

Histogram

Hybrid

772845

6

25

11

7

breaks. The perfonnance was evaluated with respect to thosesegment boundaries detected by a manual analysis. The pro-cessing time (r) for each algorithm is also shown in the tablefor the comparison of the processing load among the fourdifference metrics. Table 1 indicates that all algorithms detectcamera breaks with satisfactory accuracy. The leading suc-cess of pair-wise comparison is partially due to the fact thatits threshold was optimally tuned, which was not the case forthe other algorithms. Nevertheless, it missed two transitionsimplemented as dissolves and falsely detected six artefacts ofcamera and object movements. Due to the high requirementof computation time for Eq. 3, the likelihood-ratio algorithmis slow. Of the four algorithms, the colour histogram givesthe most promising overall result: both high speed and ac-curacy. The major sources of error in this case are due tocamera panning, object motion, and improper threshold se-

lection.

We have also tested the multi-pass concept on the car-toon video and the results are shown in Table 2. The firstthree results were obtained by using the same difference met-ric in both passes, while with the hybrid method, both thehistogram comparison and likelihood criterion were applied;and the results from each of the algorithms were used toverify each other. As presented in Sect.4.2, in the first passa lower temporal resolution (skip factor of 3) was used. In

using pairwise comparison, lower spatial resolution (aver-aging of 3 x 4 pixels) was also used in the first pass. Theresults in Table 2 indicate that this approach perfonns muchfaster than that of a single pass, with comparable or bet-

~er accuracy. In the hybrid approach, a segment boundarylS declared only if both algorithms detect it as a boundary .

~s indicated in the table, such a hybrid approach allows forhlgh-confidence camera breaks to be detected and reducesthe numher {)f f"l"p ,.jptp,..tinn"

A more generat test of segmentation, based on three differenttypes of histogram comparison, was applied to the NUS doc-umentary video. The twin-comparison approach was appliedto detect both camera breaks and graduat transitions imple-mented by speciat effects. The reason that only histogramcomparison atgorithms were used in this experiment is thatthey are insensitive to object motion and they are faster thanthe two types of pair-wise comparison atgorithms. The seg-mentation results are summarized in Table 3.

As in the case of Table 1 and Table 2, the camera breaksdetected atgorithmicatly, as well as those missed and misde-tected, are listed. In add11ion, the same information is pro-vided for graduat transitions. It should be noted that cameramovements are atso included here that may be subsequentlydetected by the technique discussed in example 3. Similarresults, obtained from the differences of colour code his-tograms for the travel documentary are given in Table 4 asfurther evatuation of the performance and robustness of thisparticular atgorithm, which will be seen later to be the opti-

mat technIque.The first two rows of Table 3 show the results of apply-

ing difference metrics (4) and (5), respectively, to grey levelhistograms. The third row shows the results of applying dif-ference metric (4) to histograms of the 6-bit colour code. Inorder to reduce computation time, only every second frameis compared (skip factor is 2), providing a temporat resolu-tion of 15 frames/s. The last set of results were obtained bya two-pass atgorithm: in the first pass the colour-code his~togram was applied with a skip factor of 10 and a reducedthreshold, and the second pass then used a skip factor of 2.

Table 3 shows that histogram comparison, based on ei-ther grey level or colour code, gives very high accuracy indetecting both camera breaks and graduat transitions. In fact,no effort has been made to tune the thresholds to obtain thesedata. Approximately 90% of the breaks and transitions arecorrectly detected. Among the three single-pass atgorithms,colour gives the most promising result: besides the high ac-curacy, it is atso the fastest of the three atgorithms. (The6..bit colour code requires only 64 histogram bins, insteadof the 256 bins required for the 8-bit grey level.), The muchsmatler number of missing breaks suggests that the colourhistogram is better than the grey level in detecting quatitativedifferences in content. In other words it would appear that6 bits of a colour code provide more effective informationthan 8 bits of grey level. Similar accuracy in detecting bothcamera breaks and transitions is exhibited in Table 4, againusing the colour histogram atgorithm, though there are manymore breaks and camera movements in this video.

Note that the x2-test histogram comparison atgorithmdoes not yield a better result, even after tuning the threshold.In fact the number of missing and fatse detections in the sec-ond row of Table 3 is dramaticatly greater than those in thefirst row. This is contrary to the conclusions of Nagasaka andTanaka ( 1991 ). where no experimentat data were presented.

22

Table 3. Detection results for camera breaks and gradual transitions based on four types of twin-comparison algorithms applied to the

NUS documentary video

Camera breaks

Nm

Transitions + camera movements

Nf NzNm NpNd Nf Nd

101

93

95

88

99

1318

2

2

Type of transitions

Algorithms used

Grey level comparison 8 1X2 grey level comparison 16 3Colour code comparison 14 2Colour code, 2 passes 21 2

--Nd. the total number of camera breaks correctly detected by each algorithm; Nm, the number of boundaries missed; Nf. the number ofboundaries misdetected; N p, the number of "transitions" actually due to camera panning; N z. the number of "transitions" actually due to

camera zooming

65607371

13

18

5

7 3

to enable the user to correct those few errors that result from

the automatic partitioning process.

The second class of errors, false camera break detections,are mainly due to sharp changes in lighting arising fromflashing lights and flickering objects, such as a computerscreen, in the image. This type of error accounts for all thefalse breaks listed in all the rows of Table 3, except for someinstances based on the x2-test histogram comparison, andfour false breaks in Table 4. Such problems may be solvedby using the robust algorithms presented in Sect.4.2 if thelarge illumination changes only occur in a restricted portionof the entire frame. The rest of the false breaks listed in Table4 result from dramatic content changes in short transitions ofa length of 20 frames or less. That is, a short transition maybe detected as one or more breaks due to the large contentchanges within the transition. This fact also accounts for fourmissing transitions that were falsely detected as breaks.

Also, x2-test comparison requires the longest computationtime among the three algorithms. Therefore, we shall notdiscuss this algorithm any further, concentrating our atten-

lion on simple histogram comparison.The multi-pass approach achieves similar accuracy, but

it is almost three times faster than the single-pass colourcode algorithm. The increased missing transitions were dueto errors in the first pass and can be improved by eitherlowering the threshold or increasing the skip factor for thefirst pass. Accuracy may also be improved by increasing thesearch vicinity around the potential boundaries in the second

pass.

Table 4. Results of applying the twin-comparison approach to cam-era break and transition detection and the camera movement detec-tion approach on the travel documentary video. Differences were

calculated from colour histograms

Hashing lights and flickering objects are also the majorsource of false detection of transitions: 9 out of 19 falsetransitions in listed in Table 4 resulted from such situations.In fact, gradual transitions are more sensitive than cam-era breaks to such changes, since even a gradual changeof lighting can cause a false detection of a transition. Fig-ure 12 shows an example of such a false transition wherefour frames from a shot of a colourful water fountain weredetected as a transition due to the colour lighting changes.

Camera breaks Transiti~ns + c~era movelI!cents

Nm N/ Nd Nm N/ Nz Np

2 10 104 6 19 20 22Nd280

N d The total number of camera breaks correctly detected by eachalgorithm; Nm, the number of boundaries missed; NJ, the numberof boundaries misdetected; N z , the number of "transitions" actuallydue to camera zooming; Np, the number of "transitions actually

due to camera panning

The missed camera break~ mainly result from the factthat the differences between some frames across a camerabreak are lower than the given threshold. Both the grey leveland colour histogram comparison algorithms failed to detectthe two frames across a camera break (Fig. 10), because thehistograms of the two frames are similar. All missed camerabreaks based on simple histogram comparison listed in Ta-ble 3 have resulted from this problem. The missing gradualtransitions are mainly due to the same problem, as shown inFig. II. However, this problem cannot be solved simply bylowering the threshold because this will lead to an increasein the number of false detections. We have to accept the factthat, as in spatial segmentation in static image processing(Rosenfeld and Kak 1982), it is almost impossible to obtaina threshold that will achieve 100% accuracy. For this reasonour segmentation system also provides a set of editing tools

Object movement is another main source of false de-tection of gradual transitions, and it accounts for most ojthe false transitions in both Table 3 and Table 4. Figure 13shows an example of such a false detection. Movement ojthe large object results in large changes between consecutiveframes that may far exceed the thresholds set for transitionsHowever, object motion in general tends not to induce theregular patterns in the motion field that arise from camenmovements. Therefore, object motion cannot be detected b)the technique developed for camera movement detection. AIalternative is to identify a moving object as a cluster in thlmotion field and then use that cluster of vectors to trac}the object. One can then assume that a transition will notake place in the middle of an object's trajectory and thueliminate false detection. Hence, detecting object motion i

23

Fig. 10a,b. Missed camera breaks: twoframes across a camera break that both thegrey-Ieveland the colour-histogram compar-ison algorithms failed to detect because thehistograms of the two frames are similar

ba

ba c

Fig. 1 la-d. A missed gradual transition: four frames from a dissolve transition that thetwin-comparison approach, using the colour-histogram comparison as the difference metric,failed to detect, because the colour histograms of the two frames in the two shots are veryclose to each other

an important part of our efforts to improve the accuracy ofvideo partitioning algorithms.

Table S. Detection of camera movements by motion detection and

analysis algorithms

As a supplement to the data in Table 3, note that onlynine of the errors of missing or false breaks or transitions arecommon to the analyses based on grey level and colour codehistograms, respectively. Such overlap will be less likely ifwe compare the results from using pair-wise comparison and

histogram comparison. In other words the results from thetwo different algorithms may compensate each other if wecombine them in a proper way, rather than the exclusive ver-ification used in the hybrid multi-pass approach presented inTable 2. However, if we use an opposite scheme where a

segment boundary is declared whenever it is detected by ei-ther of the two algorithms, the number of false detections

may increase. To avoid this, we are currently investigatingthe use of a confidence measure based on both the differenceacross two frames and the selected threshold. The conceptof "fuzzy boundary" may be used in combining the seg-ment boundaries obtained from different algorithms. While itshould he r"!"t;,,0I,, o~~" .~ ;~~r~"o tho ",..,..",.",.." nf '.."mpr"

~~

Camera movements Nd Nm N,

Panning (skip = 1) 14 OZooming (skip = 1) 3 4Zooming (skip = 5) 7 O

3OO

N d. The number of camera movements detected by the algorithms;Nm. the number of camera movements missed by the algorithms;N f .the number of false detections of camera movements

break detection, improving the accuracy in detecting gradualtransitions will probably be much more difficult.

5.2.3 Example 3: Camera movement detection

Camera pan and zoom sequences are distinguished fromgradual b"ansitions by using the motion analysis techniquesdiscussed in Sect. 3.2. 1Wenty blocks of 15 x 15 pixels, uni-f()rmlv di.'!tributed in a frame as four rows of five columns,

24

b c

Fig. 12a-d. False detection of a transition: four frames from the same camera shot inwhich a transition was detected by twin-comparison due to flashing lights and flickeringof the water fountain

dd

are used to compute the motion vectors with the followingcost function:

bi+l(m,n)1 (12)

M NPi(k) = 2:: 2:: Ibi(m, n)

m n

.

cc

..;

14Aa

subsequently recognized as camera movement. This demon- 1strates that motion detection and analysis are effective in jdistinguishing camera movements from gradual transitions :

1with a high degree of accuracy. ;

While identifying such distinctions was our primary pur- i

pose, we have also tried to classify camera movements over 1an entire video. This could provide useful information invideo content analysis and in selecting representative framesfor a detected segment. For instance, the end frame of azoom can serve as the representative frame of a segment.

Our experiment consisted of applying the algorithm di-rectly to the video package without first identifying potentialtransition frames. That is, the motion vector field betweeneach pair of consecutive frames was computed and analysedto determine if there was a camera movement during a se-quence of the frames. The results are summarized in Table5. 1\vo different skip factors were tested, and it was foundthat it is difficult to detect small zoom effects correctly withno skip (using every frame).

The motion of a large object in a camera shot is themajor source of error for camera pan detection: two falsedetections of camera pans listed in Table 5 are due to thisproblem. Another one is due to erroneous motion vectorscomputed by the block-matching algorithm, which is inher- 1ently inaccurate when there are multi ple motions inside a j

jblock. Therefore, to improve the accuracy of detecting cam- i

?era movements, we need to apply a more accurate algorithm jto determine motion vectors. Block-matching is particularly jproblematic during a zoom sequence because the area cov- i

,ered by the camera angle expands or contracts during a zoom.. IAlso, combinations of pans and zooms make the detectionof camera movement even more difficult. A more general-

Ba

Fig. 14A

frames ti

The search area for minimizing this cost function is crucialfor obtaining accurate motion vectors and acceptable com-putation s~d. In order to accommodate the video rate andthe maximum velocity of camera movement, we have chosen9 pixels per frame as the search area. It is not necessary tocompare every pair of consecutive f1:ames in a potential tran-sition to determine if they resulted from a camera movement.Instead, only two pairs of consecutive frames are used tocompute motion vectors, and the second pair is used to ver-ify the first. If the motion vectors detected from the two pairsoppose each other, then they are probably not due to cameramovements. If a potential transition lasts a large number offrames, it may be necessary to compute more than two pairsof frames in order to get higher confidence in the results.

Let (Uk, Vk) be a motion vector, where k ranges from 1to N. Following the analysis in Sect. 3.2, let (Urn, Vrn) be themodal value of these N motion vectors. Then a camera panis detected if the fraction of motion vectors equal to the modevalue exceeds a given threshold Tp. Camera zooms are de-tected using the comparison approach described in Sect.3.2that examines only the border rows and columns.

Our experiments show that those camera movements de-tected as potential transitions in Table 3 and Table 4 arecorrectly detected as camera pan or zoom instances. Fig-ore 14 shows frames from a camera panning (moving fromup to down) and zooming (moving from close to far), re-spectively, that Were detected as transitions first and were

ized roo

global 2

Bakler (a much

25

cba

Fig. 13a-d. False detection of a transition: four frames from the same camera shot, detectedas a transition by twin-comparison; due to the movement of an airplane rudder in the shot

d

.

.

...

b14Aa c

Ba b cFig. 14Aa-c, Ha-c. Camera panning and zooming detected by motion analysis: Aa-c three frames from a panning sequence; Ha-c three

frames from a zooming sequence

Another problematic situation occurs when the camerafollows a moving object. The size of the object may be suchthat the background or foreground changes due to the cam-era panning exceed the threshold T s ; the sequence is thendetected as a potential transition. If the moving object is not

ized motion classification approach is necessary , such as theglobal zoom/pan estimation algorithm proposed in Tse andBakler (1991), which requires analysis of the motion field ofa much higher density -a very time consuming process.

I

26

too large, the sequence can be detected as camera panning bymotion vector analysis. However, if the object covers morethan half the frame, there will not be a clear mode of motionvectors, and block matching will fail to detect camera pan-Ding. This accounts for a few of the false detections listed inTable 5. Such cases can no longer be treated as simple cam-era panning; rather the problem is one of motion recovery,which needs a more sophisticated analysis.

ca

or

fit

ou

ac

fa

fu

In

in:

pi

thl

Tl

je(

re

m;

us

tal

re

fr~

t~

us

sc

pi,

th'

co

sh

5.3 Summary and further discussion

be used for camera movement detection. Such analysis mayprovide more accurate information about camera movementwhen both camera and object motion exist in the same frame

sequence.The study of object tracking and motion analysis is of

importance not only for camera shot boundary detection butalso for the classification of camera shots. Initially, the twin-comparison approach was introduced to detect gradual tran-sitions but, as we have seen, it can also be used to detectpotential sequences of camera and object motion. Such se-quences can then be analysed further and classified as transi-tions, static camera shots, and shots with panning and zoom-ing. Such a classification will provide useful information forcontent analysis of the video sequence.

If lighting changes only result in changes of pixel inten-sity , the problem of false detection can be avoided by using acolour histogram comparison algorithm suggested by Annanet al. (1993). This algorithm uses the two-dimensional hueand saturation (HS) histogram based on the hue, saturationand intensity (HSI) colour space. Only the hue and saturationinformation are used because past studies have shown thatthe hue of an object surface remains the same under differ-ent lighting intensities. However, since illumination changesmay involve more than intensity change, the robustness ofthis algorithm may be limited.

6 Conclusions and future work

As an initial effort towards the automatic partitioning of avideo package into meaningful segments, techniques for de-tecting camera breaks and gradual transitions implementedby special effects as well as the effects of camera movement,have been developed and implemented. The resulting algo-rithms have been tested on a variety of video sources. Thesegment boundaries detected by these algorithms have beencompared with those detected by manual logging, and it hasbeen demonstrated that the algorithms perform with satisfac-tory accuracy. Some future improvements were suggested inSect.5.3. We shall now consider several other important as-pects of future work, including the development of morecomputer tools for video indexing.

6.

A:

pa

to

tal

mi

a

in,

au

vi,

pr

sh

WJ

F<

b~

stl

fa

These results demonstrate that segmentation can be per-formed with satisfactory accuracy. It has also been shownthat, among a variety of difference metrics, comparison ofcolour code histograms is the optimal choice; it achievesboth high accuracy and speed. The major sources of error indetecting breaks and transitions are object motion, lightingchanges and a few inherent limitations of the segmentationprocessing, the effects of which need to be reduced to im-prove the performance of the algorithms further.

As was observed in the experiments, object motion isa major obstacle to segmentation and detection of cameramovement. Fast motions of a single object or motion ofseveral objects may induce a frame-to-frame difference thatwould be detected as a camera break, while any object mo-tion will disrupt the visual flow required to identify pans andzooms. Distinguishing object motions from camera motionis much more difficult than distinguishing camera movementfrom gradual transitions, since object motions in a sequenceof frames are usually not as regular as those introduced bycamera movements. This problem is best solved by algo-rithms for object tracking. Indeed, continuous tracking of aset of objects can serve as an alternative criterion for set-ting segment boundaries, while the objects themselves arefundamental units for any indexing task. Therefore, the nextphase of our project will involve an extensive study of objecttracking algorithms. Again, optical flow analysis developedfor robot navigation can be used for object tracking, thoughthis particular analysis is still a very difficult task. Furtherstudy of optical flow analysis has been initiated as an impor-tant part of the project for both object tracking and camera

movement detection.Currently, computation of the field of motion vectors is

implemented in software, but chips developed for video com-pression and transmission, using the H.261 (Ang et al. 1991)and MPEG (Le Gall 1991) standards, can be used for thesame purpose. These chips compute a motion vector for ev-ery block of 8 x 8 pixels for moti9n-compensated interframecoding for video compression. When a video compressionchip processes input from any video source, motion vectorsare computed in real-time, thereby increasing the efficiencyof computation with much higher spatial resolution than hasbeen implemented in software. In addition, if the video ma-terial is stored as a compressed file, we can retrieve themotion vectors directly from that file. Once the dense mo-tion field can be obtained accurately, algorithms to recovermotion from optical flow (Kasturi and Jain 1991) may also

6.1 Further enhancement of the video partitioning system

An important enhancement for video indexing is the capa-bility to select a representative frame automatically so that itcan be used as a visual cue for each segment determined by Ithe partitioning process. Such cues are particularly valuableif one wishes to build an index based on a content analysisof each segment. Alternatively, an array of these represen-tative frames can be used as an index itself through which auser can browse when searching for video source material.Our current effort toward automatic selection of represen-tative frames focuses on developing algorithms that do notrequire any sophisticated (or computationally intensive) at-tempts at content analysis. For instance, in case there is a i

i

va

ev

cc

a

in

an

so

an

27

camera movement in a shot, then the end frame of the panor zoom can be selected as the representative frame. Algo-rithms that find an average frame in a sequence may serveour purpose in case there are no other useful features.

As more and more video material becomes available ina compressed digital form, it would be advantageous to per-form segmentation directly on that digital source in orderto save on the computational cost of decompressing everyframe. This is an important aspect of our current research.In general, compression of a video frame begins with divid-ing each colour component of the image into a set of 8 x 8pixel blocks (Le Gall, 1991). The pixels in the blocks arethen coded by the forward discrete cosine transform (DCT).The resulting 64 coefficients are then quantized and sub-jected to Huffman entropy encoding. This process is thenreversed during decompression. Since these coefficients aremathematically related to the spatial domain, they can beused to detect the sorts of changes associated with segmen-tation. That is, the segment boundaries are detected by cor-relating the DCT coefficients of a set of blocks from eachframe (Arman et al. 1993). Since only the differences be-tween the key frames or Intrapictures in video compressedusing MPEG, H.261 or Apple QuickTime standard containscene change information (Le Gall, 1991; Liou, 1991; Ap-ple Computer 1991), video partitioning can be based on onlythese frames. Thus, the processing time for correlating DCTcoefficients can be reduced further. Our initial studies haveshown promising results.

be engaged for tasks of segmentation and indexing. For in-stance, significant changes in spectral content may serve assegment boundary cues. Tracking audio objects will alsoprovide useful information for segmentation and indexing.Therefore, an effective analysis of the audio track and itsintegration with information obtained from image analysiswill be an important part of our future work on video seg-mentation and indexing.

Before a video index system can be developed, it will benecessary to develop an architectural specification. If one isworking with a dynamic medium like video, any index thatultimately takes the form of a printed document is likely tobe of limited value. It is preferable to view the index it-self as a piece of computer software through which the usermay interact with the video source material. This is a vi-sion we share with the Visual Information System project(Swanberg et al. 1993), along with their specification of anarchitecture that integrates databases, knowledge bases, andvision systems. We can view the index construction processas one of data insertion in a v~deo database. The index con-struction system (video parser) will first use the partitioningtechniques to segment the video stream into representativeclips. Once those ~Iips are generated, the knowledge mod-ule should be able to feed components of the video sampleinto a parser that can compare the sample to the clips. Whenthe parser identifies a matching clip or episode, it commu-nicates its results with the knowledge base to determine anindex term that can then be assigned. That is, the knowledgemodule should provide the connection between the visionsystem and a database of index categories. To support thisactivity, the parser should be able to both calculate simi-larities between video samples and extract a representativesample from a set of video clips. The building of such aparser system is the primary challenge we are facing, andit is the focus of our future work in developing a computersystem for video indexing.

6.2 Video indexing: challenges

Acknowledgements. This research is carried out with support fromthe National Science and Technology Board of Singapore.

References

As we have already observed, the sort of automatic videopartitioning reported here is only the first step in our efforttowards computer-assisted video indexing. The next impor-tant step is to perform content analysis on individual seg-ments to identify appropriate index terms. This is obviouslya more challenging task. More sophisticated image process-

ing techniques will be required to provide useful tools forautomatic identification of index entries and content-basedvideo retrieval.

Object tracking and motion analysis will not only im-prove the performance of video segmentation and camerashot classification, as pointed out in Sect.5.3, but also theywill supply the content information required for indexing.For instance, if we can extract a moving object from its

background and track its motion, we should be able to con-struct a description of that motion that can be employedfor subsequent retrieval of the camera shot. There is also thevalue of identifying the object, once it has been extracted; buteven without sophisticated identification techniques, merely

constructing an icon from the extracted image may serve asa valuable visual index term.

Besides images in video, another important source ofinformation in most video packages is the audio track. Asany film-maker knows, the audio signal provides a very richsource of information to supplement the understanding ofany video I:rnIrf'P (Mpt'7 JaR"). th;" ;nfn~"t;nn m"v ,,11:o

Anandan P (1989) A computational framework and an algorithmfor the measurement of visual motion. Int I Comput Vision2:283-310

Ang PH, Ruetz PA, Auld D (1991) Video compression makes biggains. IEEE Spectrum 28:16-19

Apple Computer (1991) Quick time developer's guide. CupertinoArman F, Hsu A, Chiu M-Y (1992) Feature management for large

video databases. Proc SPIE Conf on Storage and Retrieval forImage and Video Databases, San Diego

Bordwell D, Thompson K (1993) Film art: An introduction,McGraw-Hill, New York

Horn BKP, Schunck BG (1981) Determining optical flow, ArtifIntell17:185-203

Kasturi R, lain R (1991) Dynamic vision. In: Computer Vision:Principles, Kasturi R, lain R (eds) IEEE Computer SocietyPre~~- Wa~hin~ton- 00.469-480

HONGJIANG ZHANG obtained hisPh.D. from Technical University ofDenmark in 1991 and his BSc from

ZhengZhou University, China, in1981, both in Electrical Engineer-ing. He had also worked as an en-gineer at Shijiaz huang Commu-nications Institute,China, before his Ph.D work. Hejoined the Institute of Systems Sci-ence at the National University ofSingapore in December 1991 andis presently working on projects onvideo classification and moving ob-

ject tracking and classification. His current research interests in-clude image processing, computer vision, multimedia, digital video,and remote sensing.

A TREYI KANKANHALLI obtained ithe B. Tech. degree in electrical engi- jneering from the Indian Institute of !Technology, Delhi, in 1986 and the IM.S. degree in electrical engineer- ~ing from Rensselaer Polytechnic In- jstitute, Troy, NY, in 1987. From1988 to 1989, she was a graduateresearch assistant in the Biomedi-cal Engineering Department at RPI.Since 1990, she is with the Instituteof Systems Science, National Uni-versity of Singapore, where she isworking in the areas of image seg-mentation and object tracking.

Le Gall D (1991) MPEG: A video compression standard for mul-timedia applications. Commun ACM 34:47-58

Liou M (1991) Overview of the px64 kbitls video coding standard.Commun ACM 34:59-63

Mackay W, Davenport G (1989) Virtual video editing in interactivemultimedia applications. Commun ACM 32:802-810

Metz C (1985) Aural objects. In: Film sound: theory and practice.Weis E, Belton I (eds) Columbia University Press, New York,

pp.154-161Nagasaka A, Tanaka Y (1991) Automatic video indexing and full-

video search for object appearances. Proc 2nd Working ConfVisual Database Systems, pp.II9-133

Netravali AN, Haskell 8G (1988) Digital pictures: representationand compression. Plenum, New York

Rosenfeld A, Kak AC (1982) Chapter 10: Segmentation. Digi-tal Picture Processing, 2nd edn. Academic Press New York,

pp.57-190Ruan LQ, Smoliar SW, Kankanhalli A (1992) An analysis of low-

resolution segmentation techniques for animate video. ProcICARCV '92. I:CV-16.3.1-16.3.5

Swanberg D, Shu C-F,Jain R (1992) Knowledge guided parsing invideo databases. Proc IS&T/Spm's Symp E,ectronic Imaging:Sci Tech, San Jose

Tonomura Y (1991) Video handling based on structured informationfor hypermedia systems. Proc Int Conf Multimedia Information

Syst, Singapore, pp.333-344Tse YT, Bakler RL (1991) Global zoom pan estimation and com-

pensation for video compression. Proc ICASSP '91, pp.2725-2728

Zhang HJ, Kankanhalli A, Smoliar SW (1993) Automatic videopartitioning and indexing. Proc. IFAC '93, to appear. Detectingcamera breaks in full-motion video, available from the authors

upon request

STEPHEN WILLIAM SMOLIAR ob-tained his PhD in Applied Mathe-matics and his BSc in Mathematicsfrom MIT .He has taught ComputerScience at both the Technion, Israel,and the University of Pennsylvania.He has worked on problems involv-ing specification of distributed sys-tems at General Research corpora-tion and has investigated expert sys-tems development at both Schlum-berger and the Information SciencesInstitute (University of SouthernCalifornia). He is "currently inter-

ested in applying artificial intelligence to advanced media technolo-gies. He joined the Institute of Systems Science at the NationalUniversity of Singapore in May 1991 and is presently leading aproject on video classification.

Documents

Multimedia Systems (1993) 1: 10-28 Multimedia Systemssfchang/course/vis/REF/zhang... · Multimedia Systems (1993) 1: 10-28 Multimedia Systems @ Springer-Verlag 1993 a Received January