Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers

Near-Duplicate Video Retrieval by Aggregating Intermediate CNN LayersGiorgos Kordopatis-Zilos1,2, Symeon Papadopoulos1, Ioannis Patras2 and Yiannis Kompatsiaris1

1Information Technologies Institute, CERTH, Thessaloniki, Greece2Queen Mary University of London, Mile end Campus, UK, E14NS

23rd International Conference on MultiMedia Modeling Reykjavík, Iceland, 4-6 January 2017

Problem & Motivation• Near-Duplicate Video Retrieval (NDVR)

• Given a query video, search a video dataset to retrieve (visually) highly similar videos

• Rank the candidate videos based on their similarity to the query

• Various applications• content verification• video retrieval, management and recommendation• copyright protection

• Crucial importance of NDVR, due to the exponential growth of video content

Near-Duplicate Videos: Definition• Variety of definitions and understandings regarding the

near-duplicate videos• Adopt definition by Wu et al. (2007)

• photometric variations: gamma, contrast, brightness, etc.• editing operations: resize, shift, crop, flip• insertion of patterns: caption, logo, subtitles, sliding captions, etc.• re-encoding: video format, compression• video modifications: frame rate, frame insertion, deletion, swap

X. Wu, A. G. Hauptmann, and C. W. Ngo. Practical elimination of near-duplicates from web video search. In Proceedings of the

15th ACM international conference on Multimedia, pp. 218-227, 2007

Related Work• Variety of approaches (Liu et al., 2013)

• Video-level matching: comparison of global signatures• Global feature vectors• Fingerprints• Hash codes

• Frame-level matching: frames or sequences• Local descriptors• Spatiotemporal features

• Hybrid-level matching• Filter-and-refine methods

• TRECVID content-based copy detection (Kraaij & Awad, 2011)• duplicates artificially generated by standard transformations

W. Kraaij, and G. Awad. TRECVID 2011 content-based copy detection: Task overview. Proc. TRECVid 2010, 2011

J. Liu, Z. Huang, H. Cai, H. T. Shen, C. W. Ngo, and W. Wang. Near-duplicate video retrieval: Current research and future trends. ACM Computing Surveys, vol.45, no. 4, 44, 2013

Feature Extraction (1/2)

• Employ a pre-trained CNN with convolutional layers• Apply max pooling on every channel of the feature map of

each layer (Zheng et al., 2016)

• -dimensional vectors generated

L. Zheng, Y. Zhao, S. Wang, J. Wang, and Q. Tian. Good Practice in CNN Feature Transfer. arXiv:1604.00133, 2016

Feature Extraction (2/2)• Pre-trained CNN networks from Caffe (Jia et al., 2014):

a) AlexNet, b) VGGNet, c) GoogLeNet• Feature extraction uses the convolution layers of the

architectures

Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM int. conference on Multimedia, pp. 675-678, 2014

AlexNet VGGNet GoogLeNet

Vector Aggregation

Vector Aggregation

Vector Aggregation

Vector Aggregation

Layer Aggregation

Video Indexing and Querying• tf-idf weighting of visual words

• Inverted file indexing structure for fast search

• Retrieve candidates with at least one common visual word

• Rank candidates based on cosine similarity of their tf-idf representations

Evaluation: Dataset• Dataset: CC_WEB_VIDEO

• Videos: 13,139 videos• Keyframes: 397,965 images

CC_WEB_VIDEO: http://vireo.cs.cityu.edu.hk/webvideo/

Dataset Annotation

• Evaluation metrics• precision-recall (PR)• mean Average Precision (mAP)

http://vireo.cs.cityu.edu.hk/webvideo/

Query video Near-duplicate Videos

Dataset Examples

Results I Impact of CNN architecture and vocabulary size

Results IIPerformance using individual layers

AlexNet VGGNet GoogLeNet

Results III• Performance per query• Best runs

• CNN-V: Vector-based aggregation GoogLeNet• CNN-L: Layer-based aggregation VGGNet

Lower precision in hard queries• query 18 (Bus uncle)• query 22 (Numa Gary)

Evaluation: Comparison to SoA• Color Histograms (CH) (Wu et al., 2007) - Video-level matching, color histograms• Auto Color Correlograms (ACC) (Cai et al., 2011) - Frame-level matching, auto-

color correlograms, BoW, tf-idf weighted cosine similarity• Local Structure (LS) (Wu et al., 2007) - Hybrid-level matching, Color Histograms,

keyframes similarity of PCA-SIFT descriptors• Multiple Feature Hashing (MFH) (Song et al., 2013) - Video-level matching, hash

multiple features into Hamming space, combination of the keyframe hash code to a global video representation

• Pattern-based approach (PPT) (Chou et al., 2015) - Hybrid-level matching, pattern-based indexing tree (PI-tree), m-pattern-based dynamic programming (mPDP), time-shift m-pattern similarity (TPS)

X. Wu, A. G. Hauptmann, and C. W. Ngo. Practical elimination of near-duplicates from web video search. In Proceedings of the

15th ACM international conference on Multimedia, pp. 218-227, 2007

Y. Cai, L. Yang, W. Ping, F. Wang, T. Mei, X. S. Hua, and S. Li. Million-scale near-duplicate video retrieval system. In Proceedings of

the 19th ACM international conference on Multimedia, pp. 837-838, 2011

J. Song, Y. Yang, Z. Huang, H. T. Shen, and J. Luo Effective multiple feature hashing for large-scale near-duplicate video retrieval. In

IEEE Transactions on Multimedia, vol. 15, no. 8, pp. 1997-2008, 2013

C. L. Chou, H. T. Chen, and S. Y. Lee. Pattern-Based Near-Duplicate Video Retrieval and Localization on Web-Scale Videos. IEEE

Transactions on Multimedia, vol. 17, no. 3, pp. 382-395, 2015

Results IV Comparison against existing NDVR approaches

Future Work• Exploit the C3D features (Tran et al., 2015)

• Conduct more comprehensive evaluations • More challenging datasets: larger scale, more similar but non-

relevant videos (distractors)

• Partial Duplicate Video Retrieval (PDVR)• Assess the applicability of the approach on the PDVR problem

D. Tran, L. Bourdev, R. Fergus, L. Torresani and M. Paluri. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497, 2015

Thank you!

Get in touch:• George Kordopatis-Zilos: [email protected] • Symeon Papadopoulos: [email protected] / @sympap

With the support of:

mailto:[email protected]






Technology

Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers