Self-similarity-based partial near-duplicate video retrieval and alignment

Int J Multimed Info Retr (2014) 3:1–14DOI 10.1007/s13735-013-0049-1

REGULAR PAPER

Self-similarity-based partial near-duplicate video retrievaland alignment

Zhipeng Wu · Kiyoharu Aizawa

Received: 10 September 2013 / Revised: 16 November 2013 / Accepted: 4 December 2013 / Published online: 22 December 2013© Springer-Verlag London 2013

Abstract There have been recent studies on partial near-duplicate videos, which involve segments of videos thatare near duplicates of each other. State-of-the-art search-ing schemes usually segment the input video into clips andimplement clip-level near-duplicate retrieval. However, thesegmentation results are always poorly aligned, which leadto a difficult “unbalance” problem. In this paper, we intro-duce a self-similarity-based feature representation called theSelf-Similarity Belt (SSBelt), which derives from the Self-Similarity Matrix (SSM). In addition, a distinctive pattern inSSBelt called the Interest Corner is detected and describedby a bag-of-words representation. The visual words are thencombined into visual shingles and indexed by an inverted fileindex for fast retrieval. Another important task is to accu-rately align the unbalanced clips, for which we propose theIntensity Mark (IMark) and design a coarse-to-fine near-duplicate video localization scheme. Experimental resultsshow the effectiveness of our approach for both web-basednear-duplicate video and unbalanced video datasets. Thenear-duplicate alignment capacity of IMark is also shownto be effective.

Keywords Self-Similarity Matrix · Partial near-duplicatevideo retrieval · Self-Similarity Belt · Intensity Mark ·Visual shingle

Z. Wu (B) · K. AizawaDepartment of Information and Communication Engineering,The University of Tokyo, Tokyo, Japane-mail: [email protected]

K. Aizawae-mail: [email protected]

1 Introduction

Nowadays, with the exponential growth of digital videosources such as video-sharing websites and online TV broad-casting, the high level of redundancy caused by overlappingor duplicate videos greatly degrades the user experience forcustomers. As Wu et al. [1] has reported, users upload 65,000new videos on YouTube per day, and an average of 27 % ofvideos were found to be redundant among the search resultsgenerated by a sample of 24 popular queries. As a conse-quence, the issue of near-duplicate video retrieval is becom-ing a hot research topic.

In general, near-duplicate video is defined as identicalor approximately identical videos that might differ in var-ious transformations, such as photometric variations, edit-ing operations, and certain modifications [1,2]. In [3], Tanet al. classified them into two categories: (1) fully near-duplicate videos and (2) partial near-duplicate videos. Thepartial near-duplicate videos are those where only certainsegments of the videos are near-duplicates of each other[3]. Tracing back to the cause of partial-duplicate videos,many were found to be edited manually for specific reasons.Figure 1 illustrates an example of partial-duplicate videos,where the human-edited video “Cinema 2010” contains near-duplicate segments of the original movies. In another exam-ple, the existence of partial-duplicate videos can be foundin news broadcasts and web videos such as Internet Kuso-Movies.

To detect and retrieve partial near-duplicates effectively,the videos are usually segmented into more sophisticatedunits such as scenes and shots, and the retrieval task isthen applied to the sub-level search results. In this case,the sub-problem of near-duplicate video clip matching hasbeen brought to the forefront. However, because of variousvideo transformations and human-editing factors, there exists

123

2 Int J Multimed Info Retr (2014) 3:1–14

Fig. 1 Example of partial near-duplicate videos

an “unbalance” problem between near-duplicate video clips,which can be characterized as follows:

1. The near-duplicate clips are poorly aligned, which meansthat the start locations are not exactly justified. This maybe because of automatic clip segmentation failure undersevere transformations or because of arbitrary cutting bythe video producer.

2. For the same reason, because of scene/shot detectionfailure or arbitrary human factors, the near-duplicatesdiffer in duration, which increases the retrievaldifficulty.

We use the word “unbalance” to describe the existenceof poorly-aligned clips after video segmentation in partialnear-duplicate video retrieval. It differs from the partialnear-duplicate issue in that most of the unbalanced clipsare relatively “pure”, having come from only one sourcevideo instead of many. Given a pair of unbalanced videos,they have different lengths, different start/end locations,and the visual content may be changed by various trans-formations. In this paper, we aim to construct an effec-tive partial near-duplicate video retrieval scheme that copeswell with unbalanced data. The proposed method should berobust with respect to clip segmentation (scene/shot detec-tion) failure. In addition, it should be succinct and fastfor web-scale data processing and not require excessivememory.

The remainder of the paper is organized as follows: Sect. 2reviews recently proposed methods for near-duplicate videoretrieval. In Sect. 3, we introduce the SSBelt representa-tion for video clips. The SSBelt-based feature extractionand indexing is described in Sect. 4. Section 5 proposes anovel solution for the task of near-duplicate alignment. Sec-tion 6 evaluates the experiments and we conclude the paperin Sect. 7.

2 Related work

Video is a collection of sequential frames (images). By usingrepresentative images (keyframes), the problem of near-duplicate video retrieval can be simplified into near-duplicatekeyframe retrieval, thereby benefiting from well-developedimage retrieval systems and methods. In [4], Zhang et al.presented a part-based image-similarity measure derivedfrom the stochastic matching of Attribute Relational Graphs,which represent images in terms of their parts and part rela-tions. To build robust representations for images, researchersadopted the DoG local-interest-point detector [5] and PCA-SIFT as the descriptor [6]. In [7], Zhao et al. further refinedthe local-interest-point matching step by proposing a One-to-One Symmetric matching (OOS) algorithm. Because ofits capacity to exclude false interest-point matches, the OOSalgorithm is found to be highly reliable for near-duplicatekeyframe identification. Considering possible local deforma-tions and the spatial coherence between two point sets, Zhuet al. [8] introduced Non-rigid Image Matching. And Liuet al. [9] contributed another local descriptor for large-scalenear-duplicate detection: GOS (Gradient Ordinal Signature)which was shown to have the advantages of low dimension,simplicity in computation as well as comparable descriptionpower with classical descriptors (SIFT, PCA-SIFT, SURF).Song et al. [10] extracted multiple features and proposed anew hashing algorithm which learns hash codes and a groupof hash functions. Large modern image and video databasesrequire good scalability, with fast query processing beingnecessary. Therefore, Chum et al. [11] proposed and com-pared two retrieval schemes using Locality-Sensitive Hash-ing and Min-Hashing. In [12], Xu et al. proposed a multi-level spatial matching framework that copes well with spa-tial shifts and scale variations for image-based near-duplicateidentification. Zheng et al. [13] presented a temporal, seman-tic, and visual partitioning model to divide video corpus intosmall overlapping partitions. The near-duplicate keyframesdetected in each partition were linked up via transitivity prop-agation. Another detection framework based on frame simi-larity search and frame fusion was introduced in [14]. Wei etal. proposed a Viterbi-like frame fusion algorithm, compris-ing an online back-tracking strategy with three relaxed con-straints. In Min et al. [15], high-level semantic concepts werealso extracted from video frames to facilitate near-duplicatematching.

Although methods based on keyframes benefit from theadoption of local interest points, the following problemsappear: (1) computation-intensive detection and matchingof local features, which might be unaffordable for web-scale data processing, (2) false matching of features, whichdegrades the performance of the retrieval system, and (3)the non-guarantee of stability and repeatability of keyframeextraction. We still need a robust method for avoiding

123

Int J Multimed Info Retr (2014) 3:1–14 3

significant visual differences between keyframes extractedfrom near-duplicate videos.

Recently, several non-keyframe-based methods have beenpresented that have achieved good experimental results. Zhouet al. [16] embedded temporal information about the videostreams by introducing a novel Video Cuboid Signatureto describe video segments. This is a 3D cuboid, whoseconstruction is based on spatially and temporally adjacentpixels. In [17–19], a series of dimension-reduction-basedapproaches were presented with the basic idea of “repre-senting 3D video volume as a 2D-image-like description”.Wu et al. introduced Self-Similarity Matrix (SSM) represen-tation for video data. The SSM is constructed by exhaus-tively calculating the distances between frame pairs [17]. Byusing self-similarity rather than the visual appearance of theframe, the near-duplicate retrieval scheme copes well withsevere photometric variations such as blur, contrast change,and monochrome [20]. Zhang et al. [19] proposed the InterestSeam Image as an efficient visual synopsis for video. Firstly,an optimal vertical seam that encompasses the highest energyis identified for each frame. Then, by arranging one seamas one column, the seam image is constructed with a widthequals to the original video clip length. Figure 2 shows anexample of seam image.

However, some of these “2D image description”-basedmethods are not robust to the above-mentioned “unbalance”problem. Consider the example of Interest Seam Image.Figure 2 illustrates two videos from the MUSCLE-VCD-

Fig. 2 Interest seam image. When automatic video segmentation algo-rithm cannot accurately align the near-duplicate clips, the generatedseam image differs greatly

2007 corpus [21]. Video 2 is transformed by camcordingand adding subtitles from Source Video 1. Because of thesevere transformation, automatic video segmentation algo-rithm fails to align the clip boundaries. As described in [19],the position of the vertical seam is defined in terms of thespatiotemporal energy map calculated from the whole clipvolume. The difference in clip length results directly in dif-ferent seam positions (Fig. 2 a-1, a-2). Moreover, the sizes ofthe constructed seam images vary as the clip length changes(Fig. 2 b-1, b-2). Therefore, to establish a reliable near-duplicate video retrieval system, the unbalance problem mustbe solved in the first place.

In this paper, we focus on solving the “unbalance” prob-lem, in addition to achieving fast video description andindexing. Motivated by SSM (Self-Similarity Matrix)-basedvideo representation [17,20], a succinct version called Self-Similarity Belt (SSBelt) is proposed. By analyzing the dis-tinctive texture pattern of SSBelt, we put forward a cor-responding local feature extraction and indexing scheme.Moreover, to improve the retrieval accuracy and to imple-ment precise near-duplicate alignment (finding the exactstart/end-time codes for near-duplicate clips), we further pro-pose a coarse-to-fine post-verification algorithm based onfast histogram matching [22] and Intensity Mark (IMark).The details of our approach are described in Sects. 3, 4,and 5.

3 SSBelt

3.1 SSM representation

Given n-frame clip V = 〈F1, F2, . . . , Fn〉, SSM is built byexhaustively calculating the similarities between frames:

SSMn×n =

⎡⎢⎢⎢⎢⎢⎣

1 S1,2 S1,3 · · · S1,n

S2,1 1 S2,3 · · · S2,n

......

. . ....

...

Sn,1 Sn,2 Sn,3 · · · 1

⎤⎥⎥⎥⎥⎥⎦

(1)

where Su,v is the similarity between frames u and v. The SSMis a symmetric, positive, semi-defined matrix with value 1(totally similar) along the main diagonal.

Compared to the frame’s visual-appearance-based fea-tures, the SSM embeds the temporal information about theframe sequence and organizes it in a 2D matrix format.Because of the preexisting transformations in near-duplicatevideos (photometric variations, editing operations), visual-based features may lose their discriminative power in cer-tain situations. However, with temporal similarities betweensequential frames being more robust towards visual deforma-tions, the SSM depicts the content of near-duplicate videos

123


Fig. 3 Comparison between visual-appearance-based features and SSM

from another perspective. Figure 3 shows an example of abasic comparison between visual-appearance-based featuresand self-similarity-based feature. The original video in thecenter is from the CC_WEB_Video corpus [23]. It has beentransformed into four near-duplicate versions by blur, con-trast change, crop, and horizontal flip. Here, we extract boththe global features (color histogram, ordinal intensity sig-nature [24]) and local features (SIFT matching [5]) fromcorresponding frames and construct the SSM for the wholevideo clip. It is clear that local features show better near-duplicate matching qualities than global features. However,false matches and undetected interest points degrade its per-formance somewhat. According to Fig. 3, the striking sim-ilarities between the SSMs demonstrate the rationality andavailability of using self-similarity-based features in near-duplicate retrieval.

In [17,20], the SSM is constructed from optical flow fea-tures extracted between adjacent frames and by using cosinedistance as the similarity metric. However, extracting theoptical flow requires substantial computation. The distancesbetween feature vectors are double-precision floating-pointvalues ranging from 0 to 1, which demand large storagerequirements. To implement fast feature extraction and toreduce space complexity problems in a modern web-scalevideo retrieval system, the Binary Spatiotemporal Feature of[25] is introduced into our approach to represent the frameas an 8-bit binary signature.

The Binary Spatiotemporal Feature is inspired by ordinalmeasure [26]. In [25], the frame is partitioned into 3 × 3blocks (G[1,1], . . . , G[3,3] in Fig. 4a) and the average inten-sity value is computed for each block. Noticing that the

Fig. 4 CE-based features and LBP-based features

partitioned blocks may vary simply because of the existenceof a black border, any frame border is usually removed as pre-processing step [18]. The rank orders of the blocks’ averageintensities are then combined into a 9-dimensional ordinal-intensity signature. Based on this signature, two 8-bit features

123


called CE-based feature and LBP-based feature [25] areproposed.

CE refers to Conditional Entropy. By investigating thetop eight informative ordinal relations, ranked according toan entropy-based selection criterion, 8-bit binary signaturesare constructed as shown in Fig. 4b. Figure 4c illustrates theLBP-based feature. The first four bits and last four bits arecalculated by a central mapping function and a marginal map-ping function, respectively. Figure 4b and d summarizes theconstruction functions for the CE and LBP-based features.Details can be found in [25].

It should be noted that the original CE/LBP-based fea-tures are not robust towards certain transformations suchas horizontal/vertical flip. However, if we extract the self-similarities between frame signatures, the patterns remainunchanged. After generating the 8-bit binary signatures fromthe frames, the SSM is constructed by exhaustively calculat-ing the pair-wise similarities.

The similarity between two 8-bit signatures p and q isdefined in terms of their Hamming similarity. In other words,we count the number of zeros in p ⊕ q. Noticing that thesimilarity falls into one of the bins [0, 1, 2, 3, 4, 5, 6, 7, 8],we combine 0 and 1 into the same bin, enabling the similarityvalue to be compactly expressed by one byte. Compared tothe optical-flow-based SSM extraction scheme, the proposedCE/LBP-based method is fast with respect to both featureextraction and similarity calculation. Moreover, the storagecost for each matrix element is reduced from eight bytesdouble-precision floating-point to one byte.

3.2 From matrix to belt

Although we present a compact version of SSM in Sect. 3.1,the SSM itself is highly redundant. (1) SSM is a symmetricmatrix, which means that only the upper (lower) triangularmatrix is sufficient to represent the whole video clip; and (2)related research with SSM based approaches places specialemphasis on the area around the main diagonal [17,27,28].Let SDiag(d) denote the dth super-diagonal. Then, the matrixelements S(i, j) in SDiag(d) satisfy j = i + d, which is thesimilarity value between frame Fi and its dth subsequentframe. Intuitively, as the parameter d increases, the prob-ability of point S(i, j) existing in the near-duplicate clip’sSSM decreases. This scenario occurs when the automaticvideo segmentation algorithm splits the near-duplicate clipinto two or more segments because of severe visual defor-mations. Moreover, as d increases, the size of SSM increasesas O(d2) which may result in unacceptable processing timesand storage costs.

Noticing the physical significance of super diagonals, wepresent a novel self-similarity-based representation calledSSBelt (Self-Similarity Belt). The SSBelt is motivated bythe original SSM representation and the use of super diag-

Fig. 5 SSM and SSBelt

onals with parameter d. Given an n-frame video clip V =〈F1, F2, . . . , Fn〉, the SSBelt is defined as d×(n−d) matrix:

SSBeltd×(n−d) =

⎡⎢⎢⎢⎢⎢⎣

B1,1 B1,2 B1,3 · · · B1,n−d

B2,1 B2,2 B2,3 · · · B2,n−d

......

. . ....

...

Bd,1 Bd,2 Bd,3 · · · Bd,(n−d)

⎤⎥⎥⎥⎥⎥⎦

(2)

Bi, j = Similari t y(Fj , Fj+i ) (3)

The element Bi, j is the similarity between frames j andj + i (Eq. 3). Row k in the SSBelt is simply the kth super-diagonal of SSM. Figure 5 illustrates a video clip from corpus[29], its generated SSM, and SSBelt with parameter d = 50.The SSBelt corresponds to the yellow area of SSM. Com-pared with a square matrix, the proposed “Belt” representa-tion reduces the n × n similarity calculation and the storagecomplexity to d × (n − d). For example, the duration of thevideo clip in Fig. 5 is 3 min. With sample rate of five framesper second, the SSBelt is only 1

19 of the SSM size.

4 Feature extraction and indexing

The proposed SSBelt provides compact feature organiza-tion as well as a 2D-image-like video representation. Intu-itively, after extracting the SSBelt, we can use image retrievaltechniques to match and retrieve SSBelts generated by near-duplicate clips. On this point, the problem turns into the issueof “partial-duplicate image retrieval”. Figure 6 illustrateslocal interest points-based SSBelt matching. The unbal-anced near-duplicate video is deformed by inserting captionfrom the original source. We use Lowe’s SIFT detector anddescriptor [5].

Although the above SIFT-based method provides a solu-tion for SSBelt matching, the long processing time and mis-matches of keypoints leave room for improvement. In thissection, by analyzing the distinctive texture pattern and itsphysical significance for the SSBelt, we propose the InterestCorner as a local feature particularly designed for SSBelt.

123


Fig. 6 Keypoints matchingfor SSBelt

Fig. 7 Physical significanceof interest corner

4.1 Interest Corner

It is clear that the edges of SSBelts vary in two direc-tions: 90◦ vertically and 45◦ upward straight lines. Further-more, most of the vertical edges and upward edges con-join at a specific point in the first row of SSBelt and forma “corner”. Suppose that such a corner point is locatedat B1,k . The corresponding vertical edge points satisfy〈Bu,k |u = 1, 2, 3, . . . , d〉, and upward edge points sat-isfy 〈Bv,k+1−v|v = 1, 2, 3, . . . , d〉. In terms of Eq. 3,the vertical edge actually reflects the similarities betweenthe frame Fk and frame set 〈Fk+1, Fk+2, . . . , Fk+d〉. Theupward edge depicts the similarities between frame set〈Fk−d+1, Fk−d+2, . . . , Fk〉 and frame Fk+1. Therefore, theframe sequence can be cut before Fk+1 and formed intotwo sets. As shown in Fig. 7, the lower the visual simi-larities between frames from set 〈Fk−d+1, Fk−d+2, . . . , Fk〉and 〈Fk+1, Fk+2, . . . , Fk+d〉, the more obvious is the corner(Interest Corner) generated at the first row of SSBelt.

The Interest Corner corresponds to the notion of “visualboundary” and similar terms that describe a change inframe content. Compared with flat areas, complex texturessuch as edges and corners always denote visual appearancechanges. Based on the Interest Corners, we aim to detectthe active areas of the SSBelt and thereby implement fastretrieval.

4.1.1 Interest Corner detector

The proposed detector derives from classical edge and linedetection schemes. To effectively extract Interest Corners,the edges of the input SSBelt are first detected. Then, upwardand vertical edge lines can be extracted effectively by usingthe Hough transform. In this paper, we present an efficientmethod for localizing Interest Corners based on simplifiedHough transform.

The parametric equations of a line in matrix can be writtenas:

Column − 1 = a × (Row − 1)+ (b − 1) (4)

where (Row, Column) is the point’s location in the matrix,with a, b being the parameters describing a straight line. Fora 45◦ upward line, a = −1. The parameter b is defined by:

Column + Row − 1 = b (5)

which means that all the points satisfying Eq. 5 form into anupward line through the Interest Corner (Row, Column) =(1, b). For a 90◦ vertical line, a = 0. The parameter b isdefined by:

Column = b (6)

which means that all the points satisfying Eq. 6 form a verticalline through the Interest Corner (Row, Column) = (1, b).

Given d × (n − d) SSBelt, we build two cumulative fre-quency vectors by traversing all the matrix pixels (Algorithm1). Because Interest Corners are located at the conjunction ofvertical and upward edges, we set a threshold on the cumula-tive frequency vectors and employ non-maximum suppres-sion to restrain noise. Figure 8 illustrates an example of Inter-est Corner detection.

Algorithm 1: Cumulative frequencyInput: Eedge – Result after image edge detectionOutput: C F90◦ , C F45◦ – Cumulative frequency vectorsforeach pixel erow,col in Eedge do1

if erow,col is edge point then2++C F90◦ [row + col − 1];3++C F45◦ [col];4

end5end6return C F90◦ , C F45◦7

Image filtering for edge detection requires O(d×n) com-plexity. Traversing the edge map to calculate the cumulativefrequency vector also needs O(d × n). In all, the proposedInterest Corner detector implements fast local feature detec-tion with O(d × n) complexity.

123


Fig. 8 Interest Corner detection

Fig. 9 CS-LBP descriptor

4.1.2 Interest Corner descriptor

The region used to represent the Interest Corner’s local tex-ture pattern is fixed to an approximately square block of(d + 1)× d pixels. The column containing the Interest Cor-ner (Row, Column) = (1, b) is located as the middle column.We employ the CS-LBP descriptor [30], which is faster andperforms favorably compared to the SIFT descriptor.

CS-LBP stands for Center-Symmetric Local Binary Pat-terns. In our implementation, the input region is first dividedinto 2 × 2 Cartesian blocks (Fig. 9a) and a 16-bin CS-LBPhistogram is built for each block. In all, the concatenateddescriptor is 64-dimensional. The calculation of CS-LBP isspecified in Eq. 7:

C S(x,y) =3∑

i=0

s(ni − ni+4)2i , s(x) =

{1 x > 0

0 otherwise(7)

where ni and ni+4 correspond to the intensity values ofcenter-symmetric pairs around pixel (x, y). Figure 9b showsthe CS-LBP descriptor. More details about the descriptor canbe found in [30].

4.2 Indexing and retrieval by visual shingle

Modern large-scale databases benefit from the adoption ofthe bag-of-words representation and inverted file indexing[31]. In our approach, the 64-dimensional Interest Cornersdescriptor is quantized into visual words based on hierarchi-cal k-means clustering [32]. This hierarchical method orga-nizes the cluster centers in a tree structure, thereby greatlyreducing the processing time for descriptor quantization.

An inverted file is structured similarly to an ideal bookindex. It has an entry (hit list) for each word where all occur-rences of the word in all documents are stored [31]. In ourcase, each entry in the inverted index is a “visual shingle”,which is a binding of near visual word neighbors [25].

The idea of the visual shingle derives from the w-shinglingscheme of text retrieval [33]. To improve the discriminativepower of visual words, we bind two visual word neighbors(V W f ormer , V Wrear ) that share the same contextual infor-mation into a visual shingle (VS). In this case, given a visualword vocabulary of size K , the number of constructed visualshingles is K 2 and the shingle ID can be generated accord-ingly. We propose the visual shingle structure specified asthe following:

(V W f ormer , V Wrear , TimeGap)

It can be further simplified:

(V SI D, TimeGap)

Time Gap is the time interval between visual word neigh-bor pair. It directly embeds temporal information into theindexing structure and greatly decreases the false positiverate. Moreover, by sorting the visual shingles according tothe Time Gap, fast entry traversing is implemented for queryshingle. As Fig. 10 shows, the data are sorted by Time Gap.Given query (id1, tg1), we need to concern only the shin-gles whose Time Gap tg is constrained by (tg1 − e) < tg <

(tg1 + e). Because the Time Gaps are sorted in ascendingorder, the entry traversing stops when tg > (tg1 + e). Here,e is controlled by the user.

Although the binding of visual words into visual shin-gle promotes the discriminative power, it also decreases therobustness towards coping with Interest Corner detectionfailure. Noticing this fact, we introduce a duplicate visualshingle generation strategy. In our method, visual shingles aregenerated by a visual word and its 1st/2nd/3rd nearest neigh-bors. With reference to Fig. 11, suppose we have detectedfive visual words in SSBelt. Our scheme generates fourvisual shingles based on the 1st nearest neighbor, three visualshingles based on the 2nd nearest neighbor, and two visualshingles based on the 3rd nearest neighbor. Although the

123


Fig. 10 Inverted indexingfor visual shingles

Fig. 11 Duplicate visual shingle generation

duplicate visual shingle generation strategy requires muchmore processing time and storage space, it is actually a trade-off between robustness and efficiency.

Retrieval by inverted file index is formulated as a votingproblem whereby the visual shingle in the query SSBelt votesfor its matched SSBelt files in the database. After search-ing the index, the candidate set of near-duplicate clips isreturned to the user. In the next section, we introduce apost-verification scheme to re-rank the results and implementnear-duplicate alignment.

5 Near-duplicate alignment

The post-verification step in our approach is towards thetask of near-duplicate alignment, which aims to find near-duplicate video clips as well as localize the exact start/end-time codes of the near-duplicate area. It is also referredto as near-duplicate localization. In this paper, the post-verification is implemented via a coarse-to-fine mechanism.

5.1 Coarse verification by histogram matching

Given two video clips p and q and their correspondingSSBelts Sp, and Sq , the two clips usually differ greatly with

respect to duration, because of the existing false positive sam-ples and “unbalance” problem. One basic idea for fast verifi-cation is to use simple global features such as a gray-scale his-togram on Sp and Sq . However, since the sizes of SSBelts aredifferent, sliding-window matching is required. In this case,although there are only eight histogram bins, exhaustivelytraversing the sub-window region is time-consuming. Wetherefore introduce a fast histogram matching algorithm spe-cially designed for unbalanced data [22]. If Sp measures d×nand Sq measures d × N (n < N ), the algorithm reduces thesliding window’s complexity from O(d×n×N ) to O(d×N ).

For normalized 8-bin histograms h p and hq , we definetheir similarity using the expected likelihood kernel:

K (h p, hq) =8∑

i=1

h p(i)hq(i)

=8∑

i=1

h p(i)

(1

|W |∑x∈W

δ[b(x)− i])

= 1

|W |∑x∈W

8∑i=1

h p(i)δ[b(x)− i]

= 1

|W |∑x∈W

h p(b(x)) (8)

where δ[n] is the Kronecker delta, with δ[n] = 1 (n =0), δ[n] = 0 otherwise, and b(x) is the intensity value ofpixel x in region W . Given Sp and sub-window region Wfrom Sq , Eq. 8 indicates that we can use Sp’s pre-calculatedhistogram h p to initialize region W . The average value of Wis then the expected likelihood kernel histogram similarity.The sum of values h p(b(x)) can be computed efficiently byintegral image [34], as shown in Algorithm 2.

The similarity set 〈Simi 〉 acts as a filter for defi-nitely rejected sub-windows. By eliminating low-similarity

123


Algorithm 2: Fast histogram-based verificationInput: Sp – smaller SSBelt; Sq – larger SSBelt;

〈Wi 〉 – set of sub-windowsOutput: 〈Simi 〉 – set of sub-window similaritiesh p=NormalizedHistogram(Sp);1foreach pixel x in Sq do2

b(x)← h p(b(x)) ;3end4Iint = CreateIntegralImage(Sq );5return 〈Simi 〉 = ExtractSubwindowAverage(Iint , 〈Wi 〉) ;6

window positions, the required number of careful compar-isons is greatly reduced.

5.2 Refined localization by IMark

Histogram-based coarse verification enables fast filtering ofcandidates. However, using only global features may notgenerate desirable near-duplicate localization results. In thispaper, we encode the intensity information for frames into avector, thereby achieving refined alignment.

Motivated by the CE/LBP-based features in [25], we aimto use the 8-bit ordinal intensity-related features for framerepresentation. The ordinal intensity is already calculatedin the SSBelt-building stage and can therefore be obtainedsimultaneously. As mentioned above, the CE/LBP-based fea-tures are sensitive to rotation and horizontal/vertical flip. Inview of this, we propose Intensity Mark (IMark) as a robustdescriptor for horizontal/vertical flip.

For the 3 × 3 ordinal intensity blocks, we exclude blockG[2,2] at the center and concatenate the remaining blocks intoa ring. We define the ring’s “principal direction” by findingthe largest-intensity-block and the order of the two blockson its sides. Figure 12 shows an example. Block G[1,2] isfound to be the largest intensity block in the ring. Then, byinvestigating G[1,2]’s side block intensities, we find G[1,3]is relatively larger, and the principal direction of the ringis therefore defined as 〈G[1,2] → G[1,3] → G[2,3] →G[3,3] → G[3,2] → G[3,1] → G[2,1] → G[1,1]〉. For the firstand last members of the ring, we compare them with centerblock G[2,2]. For other ring members, we compare them withtheir successors. Finally, the 8-bit binary IMark signature is11110001. As shown in Fig. 12, the introduction of “princi-pal direction” enables the robustness for IMark when copingwith horizontal/vertical flip.

Given an n-frame video clip p, the d × (n − d) SSBeltand the 1 × n IMark signature are generated. NoticingSSBelt and IMark represent the clip content from differ-ent viewpoints (self-similarity and visual appearance), IMarkis a good supplement to SSBelt based near-duplicate videoretrieval. After coarse verification, the IMark similaritycalculation between p and sub-window W further refinesthe localization result. The IMark similarity is defined inEq. 9:

Sim(p, W )=1

n

n∑i=1

Num Zeros[I Markp(i)⊕ I MarkW (i)]

(9)

Fig. 12 IMark descriptor

123


6 Experiments

In this section, we evaluate the proposed approach onnear-duplicate video databases [21,23,29,40]. Experiment1 aims to test the general effectiveness and efficiency ofSSBelt-based method. In Experiment 2, we focus on thenear-duplicate alignment task. Experiment 3 is designedto evaluate the performance on unbalanced near-duplicatevideos.

6.1 Experiment 1

6.1.1 Dataset description

Experiment 1 is based on the public CC_WEB_VIDEOdataset [23]. It contains 24 queries and returned videos down-loaded from YouTube, Google Video, and Yahoo! Video.In total, there are 12,790 videos, with 27 % of them beinglabeled as near-duplicate videos.

6.1.2 Method setup

SSBelt-based methods: Firstly, the videos are sampled at10 fps. By extracting the CE/LBP based ordinal signa-tures, two kinds of SSBelts (SSBelt_CE and SSBelt_LBP)are built. We then detect the Interest Corners and quantizethe 64-dimensional descriptor into visual words. The visualword vocabulary contains 34 items. Next, neighbor visualwords are combined into visual shingles and inserted intothe inverted file index. Finally, the retrieval result is returnedby querying the index. Noticing that the near-duplicate align-ment task is not required in this experiment, we eliminate thepost-verification step.

Comparisons: We compare the proposed self-similarity-based method with visual-appearance-based methodsreported in [10,25,35,36]. These methods cover a range ofglobal and local feature extractions on frames. We brieflydescribe them as follows:

• Spatiotemporal features (STF_CE/STF_LBP) [25]: Theinput video is sampled at 1 fps. The CE/LBP signatures arethen extracted. Following the generation scheme in [25],a bag-of-visual-shingle (BoVS) representation is built forneighbor signatures. The retrieval task is implemented viafast histogram intersection similarity measurements andan inverted file index.• Color Histogram (SIG_CH) [36]: Keyframes are first

detected for the input video. Then the keyframes arerepresented by a 24-bin color histogram in HSV colorspace.• Hierarchical method (HIRACH) [36]: The color his-

togram is employed initially as a fast filter for very dis-

similar video. Then more accurate but expensive localfeature based pairwise comparison among keyframes isadopted. The global signatures and pairwise measures arecombined into the final results.• Ordinal Measure (OM) [26]: The videos are sampled at

10 fps. For each frame, 3× 3 blocks and the correspond-ing ordinal vector is constructed. The video signatureis then matched via a coarse-to-fine approach that usesSequence Shape Similarity (SSS) and Real Sequence Sim-ilarity (RSS).• Multiple feature Hashing (MFH) [10]: Shot-based sam-

pling method is used to select keyframes. Then, LBP fea-ture is extracted as a local feature, which is combined withthe global HSV feature. The hash codes and functions arelearned from the training video data.• Adaptive Structure Video Tensor (ASVT) [35]: DoG-

based local points are detected in the keyframes anddescribed by PCA-SIFT descriptor. Then, PDF estimationis employed to build the structure tensor. After dimension-ality reduction, R-tree is constructed to enable efficientkNN search. DT S(S, Q) is chosen as the distance mea-surement for ASVT series.

6.1.3 Results and discussion

We evaluated the effectiveness for near-duplicate videoretrieval on the CC_WEB_VIDEO dataset by using Precision-Recall curve. Figure 13 illustrates the results for the com-pared methods. Table 1 gives a comparison with respect toprecision, time, and storage costs. In Table 1, the MAP col-umn refers to the Mean Average Precision of the 24 queries.The Time column is the average searching time over the24 queries, and the Storage column gives the total memorycost to retrieve one query (‘–’: not reported in the originalpaper).

Fig. 13 Results on CC_WEB_VIDEO dataset

123


Table 1 MAP, time, and storage costs on CC_WEB_VIDEO

Methods MAP Time (s) Storage (MB)

STF_CE [25] 0.950 3.6× 10−3 4.88

STF_LBP [25] 0.953 3.7× 10−3 5.18

SIG_CH [36] 0.892 – –

HIRACH [36] 0.952 9.6 –

OM [26] 0.910 2.9 –

MFH [10] 0.954 – –

ASVT [35] 0.956 – –

SSBelt_CE 0.920 4.6× 10−3 0.37

SSBelt_LBP 0.922 4.6× 10−3 0.38

The bold values denote the best performance among all the comparedmethods

6.2 Experiment 2

6.2.1 Dataset and tasks

Experiment 2 is based on the public MUSCLE-VCD-2007corpus [21]. This corpus is used as the official dataset forCIVR’07 Copy Detection Evaluation. It contains about 100 hof video material sourced from the web, TV archives, andmovies. There are two retrieval tasks (ST1, ST2) aimed atdifferent applications:

ST1: The query videos are copies of complete videos(from 5 min to 1 h) in the source database. The query canbe re-encoded, noise-affected, or slightly retouched from theoriginal source. Each query has at most only one correspond-ing entry in the source database. Overall, there are 15 queriesfor ST1.

Metric: Given an input query, one answer is returned,which indicates the corresponding source file in the databaseor that the query is not a copy. The final criterion is the numberof correct answers divided by the number of total queries.

Quality = Numcorrect

Numtotal

ST2: Several segments have been selected randomlyfrom the database and post-processed by professional video-editing software. Queries include parts of several videosbelonging (or not present in) the database. Segments belong-ing to the database must be identified and localized by theirstart/end-time codes. Overall, there are 3 queries and 21 seg-ments for ST2.

Metric 1 (considering matched segments): This is com-puted from the percentage of mismatched video segments inall queries.

Quality1 =Numcorrect − FalseAlarm

NumSegments

Metric 2 (considering matched frames): This is computedfrom the percentage of mismatched frames in all queries.

Quality2 = 1− Nummiss

Num Frames

“Considering matched frames” in ST2 is equivalent to thetask of near-duplicate alignment, which requires the post-verification step in our approach. Intuitively, it places higherdemands on the retrieval system.

6.2.2 Results and discussion

In Experiment 2, the input videos are first segmented intoscenes. Scenes shorter than 10 s are removed, and sceneslonger than 10 min are arbitrarily segmented. The task ofST1 is then formulated as a voting problem, where we findthe best-matched correspondent for each scene in the queryand allow voting for the final matched video. Table 2 showsthe comparison of different results for MUSCLE-VCD-2007ST1. (ADV, IBM, CITYU and CAS stands for the participantteams in CIVR’07 evaluation).

According to Table 2, both the SSBelt_CE and SSBelt_LBP method achieved satisfying results in ST1. The resultsfor ST2 are shown in Table 3. The task “considering frames”evaluates the performance for near-duplicate alignment. Inour implementation, the start/end-time codes for the inputquery are located via the post-verification step involvingIMark. The inherent assumption in IMark is that the extractedvideo segments are “ideally consecutive” with no temporalediting such as inserting or deleting frames. In [26], Hua et al.proposed a dynamic-programming-based sequence methodthat is robust to such temporal editing. However, in ST2 the

Table 2 Comparisons on MUSCLE-VCD-2007 ST1

Methods Quality

ADV 86 %

IBM 86 %

CITYU 66 %

CAS 53 %

GOS [9] 83 %

SSM [17] 100 %

Yeh [37] 93 %

Poullot [38] 93 %

Zheng [39] 100 %

SSBelt_CE 100 %

SSBelt_LBP 100 %


123


Table 3 Comparisons on MUSCLE-VCD-2007 ST2

Considering segments

CITYU 86 %

ADV 33 %

Yeh [37] 86 %

Poullot [38] 86 %

Zheng [39] 90 %

SSBelt_CE 95 %

SSBelt_LBP 95 %

Considering frames

CITYU 76 %

ADV 17 %

Zheng [39] 85 %

SSBelt_CE+IMark 90 %

SSBelt_LBP+IMark 90 %


Fig. 14 Failed example and its SSBelt

segments do not include transformations such as inserting ordeleting frames. In this case, we can use Eq. 9 directly tofind the best matched time codes. Future work may incorpo-rate the algorithm in [26] to achieve robust solutions undertemporal editing conditions.

As shown in Table 3 (“Considering segments”), amongthe 21 video segments, 20 were successfully matched withone failure (no returned answer). The failed segment and itsSSBelt are shown in Fig. 14. It records a person’s monologue,with all the frames in the 103-s segment being very similarto each other. Because only one Interest Corner was detectedin the SSBelt, we could not combine the visual word intovisual shingle and retrieve via the index. This video segment

is representative of scenarios in which self-similarity basedmethods lose their discriminative power.

With respect to the near-duplicate alignment task, the pro-posed IMark-based method returned good results. The aver-age quality considering frames is 90 %. Apart from the failedvideo segment, the average localization quality for 20 seg-ments is more than 95 %, which significantly outperforms theother approaches.

6.3 Experiment 3

In this experiment, we simulate a real unbalanced near-duplicate video retrieval environment by collecting inaccu-rately segmented video segments. The dataset is built with300 source videos from MUSCLE-VCD-2007 [21], MPEG[29], TRECVID’08 [40] and TV archives. Following theset up used in [20], we employ two or more transforma-tions randomly selected from 〈Adding Noise, Blur, CaptionInsertion, Changing Ratio, Contrast Change, Crop, Flip, Pic-ture in Picture, Resolution Reduction, Analog Recording〉 toobtain 300 multi-transformed videos. We then segment thesource videos and the corresponding transformed videos viascene detection. A set of 500 inaccurately segmented clippairs ranging from one to 6 min are extracted and formedinto the core dataset. By adding another 500 video clips intothe source, the final dataset includes 500 queries and 1,000sources, with each query corresponding to exactly one sourcevideo clip.

We use the metric Quality to evaluate the effectiveness ofapproaches. The compared methods are:

– Ordinal Measure[26]– Seam Image[19]– Spatiotemporal CE/LBPfeatures[25]– SSBelt_LBP and SSBelt_LBP + IMark

The sample rate for OM, SEAM, SSM and our approachis 10 fps. For STF_CE/STF_LBP, it is 1 fps. Table 4 gives

Table 4 Results for experiment 3

Methods AVG quality AVG time (s) Alignment

OM [26] 73.5 % 20.3 60.0 %

SEAM [19] 31.0 % 6.3× 10−3 No

STF_CE [25] 63.0 % 4.3× 10−3 No

STF_LBP [25] 68.5 % 4.3× 10−3 No

SSBelt_LBP 78.0 % 3.0× 10−2 No

SSBelt_LBP+IMark 88.5 % 6.2 84.0 %


123


the average matching quality (considering segments, AVGQuality), average matching time (AVG Time), and averagealignment quality (Alignment). In this table, “No” means“not having the near-duplicate localization ability” for thenear-duplicate alignment task. According to Table 4, by ben-efiting from robustness towards transformations such as Flip,our self-similarity-based methods achieved better matchingresults than the other ordinal measure-based methods (OM,STF_CE/STF_LBP). As we discussed in Sect. 2, the SEAMmethod suffers from inconsistent seam positions detectedin unbalanced video content. For the task of near-duplicatealignment, IMark provides an effective solution for detect-ing the start/end-time codes. IMark is particularly robusttowards transformations such as Adding Noise, Blur, Con-trast Change, Changing Ratio, Flip, and Resolution Reduc-tion. The coarse verification step that uses fast histogrammatching greatly reduces the candidate positions for IMarklocalization. However, because of the small number of his-togram bins (only an 8-bin histogram is generated fromSSBelt), the coarse verification step still leaves room forfurther improvement. The average matching time, includingIMark, is 6.2 s, which is far down from expectations. We willseek a fast and effective solution for near-duplicate alignmentin future work.

7 Conclusion

We have focused on the “unbalance” problem and the conse-quent near-duplicate alignment task for partial near-duplicatevideo retrieval. “Unbalance” refers to the scenarios in whichnear-duplicate video clips are not correctly aligned. In theproposed SSBelt approach, we extract frame-level binarysignatures and encode them into a succinct matrix repre-sentation by mining self-similarity patterns. To implementefficient retrieval, Interest Corners are detected and insertedinto an inverted file index. In addition, we propose IMarkfor near-duplicate alignment. In the experiments, the generalcapacity for coping with web videos, unbalanced videos, andlocalization effectiveness were investigated and analyzed.

Our future work is focused on building a fast near-duplicate retrieval system towards web-scale data. In thecurrent coarse-to-fine retrieval mechanism, the SSBelt-basedcoarse step aims to detect near-duplicate videos and providea set of candidates for refined retrieval and localization. Intu-itively, we can improve the speed on both of the coarse sideand refined side. Rather than building inverted index for thevisual words and visual shingles in SSBelt, one promisingdirection is to employ approximate similarity search tech-niques such as locality-sensitive hashing (LSH) [6]. On theother hand, without extracting local interest corners, we canbuild global fingerprints for SSBelts. These fingerprints canbe efficiently index by using a balanced binary (multi-way)

search tree. That is, we generate (random) project direc-tion for each tree node and map feature vectors into lower-dimensional space in order to dispatch them into differentchild nodes. This approach provides fast and stable retrievaltime. The complexity is O(h), where h is the height of thesearch tree.

With respect to the refined search step, IMark is provedto be the bottleneck. We can simply lower the sample rate toimprove IMark matching speed, but this will result in sideeffects to retrieval accuracy. Another feasible way is to sim-plify IMark itself. Instead of computing IMark for each ofthe sampled frames, we can only extract the pattern changesbetween adjacent descriptors and compress the IMark vector.

References

1. Wu X, Ngo CW, Hauptmann AG, Tan HK (2009) Real-time near-duplicate elimination for web video search with content and con-text. IEEE Trans Multimed 11(2):196–207

2. Cherubini M, de Oliveira R, Oliver N (2009) Understanding near-duplicate videos: a user-centric approach. In: ACM internationalconference on multimedia, pp 35–44

3. Tan HK, Ngo CW, Chua TS (2010) Efficient mining of multiplepartial near-duplicate alignments by temporal network. IEEE TransCircuits Syst Video Technol 20(11):1486–1498

4. Zhang DQ, Chang SF (2004) Detecting image near-duplicate bystochastic attributed relational graph matching with learning. In:ACM international conference on multimedia, pp 877–884

5. Lowe DG (2004) Distinctive image features from scale-invariantkeypoints. Int J Comput Vis 60(2):91–110

6. Ke Y, Sukthankar R, Huston L (2004) Efficient near-duplicatedetection and sub-image retrieval. In: ACM international confer-ence on multimedia, pp 869–876

7. Zhao WL, Ngo CW, Tan HK, Wu X (2007) Near-duplicatekeyframe identification with interest point matching and patternlearning. IEEE Trans Multimed 9(5):1037–1048

8. Zhu J, Hoi SC, Lyu MR, Yan S (2011) Near-duplicate keyframeretrieval by semi-supervised learning and nonrigid image matching.ACM Trans Multimed Comput Commun Appl 7(1):4

9. Liu H, Lu H, Wen Z, Xue X (2012) Gradient ordinal signature andfixed-point embedding for efficient near-duplicate video detection.IEEE Trans Circuits Syst Video Technol 22(4):555–566

10. Song J, Yang Y, Huang Z, Shen HT, Hong R (2011) Multiple featurehashing for real-time large scale near-duplicate video retrieval. In:ACM international conference on multimedia, pp 423–432

11. Chum O, Philbin J, Isard M, Zisserman A (2007) Scalable nearidentical image and shot detection. In: ACM international confer-ence on image and video retrieval, pp 549–556

12. Xu D, Cham TJ, Yan S, Duan L, Chang SF (2010) Near duplicateidentification with spatially aligned pyramid matching. IEEE TransCircuits Syst Video Technol 20(8):1068–1079

13. Zheng YT, Neo SY, Chua TS, Tian Q (2007) The use of temporal,semantic and visual partitioning model for efficient near-duplicatekeyframe detection in large scale news corpus. In: ACM interna-tional conference on image and video retrieval, pp 409–416

14. Wei S, Zhao Y, Zhu C, Xu C, Zhu Z (2011) Frame fusion for videocopy detection. IEEE Trans Circuits Syst Video Technol 21(1):15–28

15. H-s Min, Choi JY, De Neve W, Ro YM (2012) Near-duplicatevideo clip detection using model-free semantic concept detectionand adaptive semantic distance measurement. IEEE Trans CircuitsSyst Video Technol 22(8):1174–1187

123


16. Zhou X, Chen L (2010) Monitoring near duplicates over videostreams. In: ACM international conference on multimedia, pp 521–530

17. Wu Z, Huang Q, Jiang S (2009) Robust copy detection by min-ing temporal self-similarities. In: IEEE international conferenceon multimedia and expo, pp 554–557

18. Cui P, Wu Z, Jiang S, Huang Q (2010) Fast copy detection basedon Slice Entropy Scattergraph. In: IEEE international conferenceon multimedia and expo, pp 1236–1241

19. Zhang X, Hua G, Zhang L, Shum H (2010) Interest seam image.In: IEEE conference on computer vision and pattern recognition,pp 3296–3303

20. Wu Z, Jiang S, Huang Q (2009) Near-duplicate video matching withtransformation recognition. In: ACM international conference onmultimedia, pp 549–552

21. MUSCLE-VCD-2007: bench mark for video copy detection.https://www.rocq.inria.fr/imedia/civr-bench/

22. Chang HW, Chen HT (2010) A square-root sampling approachto fast histogram-based search. In: IEEE conference on computervision and pattern recognition, pp 3043–3049

23. CC_WEB_VIDEO: Near-duplicate web video dataset. http://vireo.cs.cityu.edu.hk/webvideo/

24. Hampapur A, Hyun K, Bolle R (2002) Comparison of sequencematching techniques for video copy detection. In: SPIE storageand retrieval for media databases, pp 194–201

25. Shang L, Yang L, Wang F, Chan KP, Hua XS (2010) Real-time largescale near-duplicate web video retrieval. In: ACM internationalconference on multimedia, pp 531–540

26. Hua X, Chen X, Zhang H (2005) Robust video signature basedon ordinal measure. In: IEEE international conference on imageprocessing, pp 685–688

27. Foote J, Uchihashi S (2001) The beat spectrum: a new approach torhythm analysis. In: IEEE international conference on multimediaand expo, pp 224–227

28. Junejo IN, Dexter E, Laptev I, Prez P (2010) View-independentaction recognition from temporal self-similarities. IEEE Trans Pat-tern Anal Mach Intell 33(1):172–185

29. Bober M, Bober SK (2002) Description of mpeg-7 visual coreexperiments. ISO/IEC JTC1/SC29/WG11 N, 2002, 5166

30. Heikkil M, Pietikainen M, Schmid C (2009) Description of interestregions with local binary patterns. Pattern Recognit 42(3):425–436

31. Sivic J, Zisserman A (2003) Video google: a text retrieval approachto object matching in videos. In: IEEE international conference oncomputer vision, pp 1470–1477

32. Nister D, Stewenius H (2006) Scalable recognition with a vocab-ulary tree. In: IEEE conference on computer vision and patternrecognition, pp 2161–2168

33. Broder AZ (1997) On the resemblance and containment of docu-ments. In: Compression and complexity of sequences, pp 21–29

34. Viola P, Jones M (2004) Robust real-time face detection. Int J Com-put Vis 57(2):137–154

35. Zhou X, Chen L, Zhou X (2012) Structure tensor series-basedlarge scale near-duplicate video retrieval. IEEE Trans Multimed14(4):1220–1233

36. Wu X, Hauptmann AG, Ngo CW (2007) Practical elimination ofnear-duplicates from web video search. In: ACM international con-ference on multimedia, pp 218–227

37. Yeh M-C, Cheng K-T (2009) A compact, effective descriptor forvideo copy detection. In: ACM international conference on multi-media, pp 633–636

38. Poullot S, Crucianu M, Buisson O (2008) Scalable mining of largevideo databases using copy detection. In: ACM international con-ference on multimedia, pp 61–70

39. Zheng L, Qiu G, Huang J, Fu H (2011) Salient covariance fornear-duplicate image and video detection. In: IEEE internationalconference on image processing, pp 2537–2540

40. TRECVID2008: TREC Video Retrieval Evaluation. http://www-nlpir.nist.gov/projects/trecvid

123

https://www.rocq.inria.fr/imedia/civr-bench/

http://vireo.cs.cityu.edu.hk/webvideo/

http://vireo.cs.cityu.edu.hk/webvideo/

http://www-nlpir.nist.gov/projects/trecvid

http://www-nlpir.nist.gov/projects/trecvid

Documents

Self-similarity-based partial near-duplicate video retrieval and alignment