Intelligent Thumbnail Selection

Intelligent Thumbnail SelectionKamil Sindi, Lead Data Scientist

JW Player

1. Company

a. Open-source video player

b. Hosting platform

c. 5% of global internet video traffic

d. 150+ team

2. Data Team

a. Handling 5MM events per minute

b. Storing 1TB+ per day

c. Stack: Storm (Trident), Kafka, Luigi,

Elasticsearch, Spark, AWS, MySQL Customers

Thumbnails are Important

● Your video's first impression

● Types: Upload, Manual, Auto (default)

● Manual >> Auto in Play Rate

● Current Auto is 10th second frame

● Many big publishers only use Manual

● 90% of Thumbnails are Auto! :-(

source: tastingtable.com (2016-10-12)

What’s a “Good” Thumbnail?

It’s subjective to the viewer!

Common themes:

● Not blurry

● Balanced brightness

● Centered objects

● Large text overlay

● Relevant to subject

vs

Source: Big Buck Bunny, Blender Studios

Manually Creating a Model is Hard

● Which features to extract?

● How to describe those features?

● How to weight features?

● How to penalize overfitting of models?

● Many techniques: SIFT, SURF, HOG?

Need to be an expert in Computer Vision :-(

Edge Detection Color Histogram Pixel Segmentation

So Many Image Features...

Deep Learning● Learn features implicitly

● Learn from examples

● Techniques to avoid overfitting

● Success in a lot of applications:

○ Image classification

○ Image captioning

○ Machine translation

○ Speech-to-Text

Inception

● Learn multiple models in parallel; concatenate

their outputs (“modules”)

● Factoring convolutions (“towers”): e.g. 1x1

convs followed by 3x3

● Parameter reduction: GoogleNet (5MM) vs.

AlexNet (60MM), VGG (200MM)

● Auxiliary classifiers for regularization

● Residual connections (Inceptionv4)

● Depthwise separable convolutions (Xception)

https://www.udacity.com/course/deep-learning--ud730

https://arxiv.org/abs/1409.4842

Source: Rethinking the Inception Architecture for Computer Vision






1. Dimensionality reduction: fewer

channels, strides, feature pooling

2. Parameter reduction: faster, less

overfitting

3. “Cheap” nonlinearity: 1x1 + 3x3 is non-lin

4. Cross-channel ⊥ spatial correlations

1x1 Convolutions: what’s the point?

1x1 convolution with strides Pooling with 1x1 convolution

Source: http://iamaaditya.github.io/2016/03/one-by-one-convolution/

In Convolutional Nets, there is no such thing as

“fully-connected layers”. There are only

convolution layers with 1x1 convolution kernels. –

Yann LeCun

http://iamaaditya.github.io/2016/03/one-by-one-convolution/

http://datascience.stackexchange.com/questions/12830/how-are-1x1-convolutions-the-same-as-a-fully-connected-layer



InceptionV3 Architecture

https://research.googleblog.com/2016/03/train-your-own-image-classifier-with.html

Dog (0.80)Cat (0.05)Rat (0.01)...



Transfer Learning

1,000,000 images, 1,000 categories● Use pre-trained model

○ Cheaper (no GPU required)

○ Faster

○ Prevents overfitting

● Penultimate (“Bottleneck”) layer contains

image’s “essence” (CNN codes); acts as a

feature extractor

● Just add a linear classifier (Softmax; lin-SVM)

to Bottleneck

Fine Tuning + Tips

● Change classification layer +

backprop layers back

● Idea:

Early layers do basic filters; later

layers more dataset specific

● Generally use a pre-trained model

regardless of data size or similarity

Data Size (per class)

< 500 > 500 > 5,000

Similar to original Too small TL

TL + FT earlier layers

Not Similar Too smallTL on earlier

layersTL + FT entire

network

Other Applications of Transfer Learning

Google “Show and Tell”https://github.com/tensorflow/models/tree/master/im2txt

Image Captioning Image Search

http://www.slideshare.net/ScottThompson90/applying-transfer-learning-in-tensorflow

https://github.com/tensorflow/models/tree/master/im2txt

https://github.com/tensorflow/models/tree/master/im2txt




Training: Thesis

Train to differentiate between Manual and Auto

● Manual thumbnails are (usually) better than Auto

● Select Manual with high views and play rate;

Auto selection is random but low plays

● We have a lot of examples: 10K+ manual

● We used InceptionV3 pre-trained on ImageNet

Training: Examples

Positive (Manual)

Negative Examples

Negative (Auto)

Video Pre-Filter

Use FFMPEG to select top 100 frame

candidates

Methods:

● Color histogram changes to avoid

dupes

● Coded Macroblock information

● Remove “black” frame

● Measure motion vectors

Motion Vectors

Source: Sintel, Blender Studios

Engineering

Demo: Evaluation Tool

Demo: Examples Original Auto (10th second frame)

Top scored frames from new model

What’s Next

● Refinements:

○ Fine tuning to earlier layers

○ Other models: ResnetV2, Xception

○ Pre-Filtering: adaptive, hardware accel.

● Products:

○ New auto thumbnails

○ Thumbstrips

Resources

Blog Posts:● https://research.googleblog.com/2016/03/train-your-own-image-classifier-with.html● https://github.com/tensorflow/models/tree/master/inception● http://iamaaditya.github.io/2016/03/one-by-one-convolution/● http://www.slideshare.net/ScottThompson90/applying-transfer-learning-in-tensorflow● https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html● http://cs231n.github.io/transfer-learning/● https://research.googleblog.com/2015/10/improving-youtube-video-thumbnails-with.html● https://pseudoprofound.wordpress.com/2016/08/28/notes-on-the-tensorflow-implementation-of-inception-v3/● https://adeshpande3.github.io/adeshpande3.github.io/The-9-Deep-Learning-Papers-You-Need-To-Know-About.html

Papers:● Rethinking the inception architecture for computer vision. https://arxiv.org/abs/1512.00567● Xception: Deep Learning with Depthwise Separable Convolutions. https://arxiv.org/abs/1610.02357● CNN Features off-the-shelf: an Astounding Baseline for Recognition. https://arxiv.org/abs/1403.6382● DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. https://arxiv.org/abs/1310.1531● How transferable are features in deep neural networks? https://arxiv.org/abs/1411.1792



https://github.com/tensorflow/models/tree/master/inception

https://github.com/tensorflow/models/tree/master/inception





https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html

https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html

http://cs231n.github.io/transfer-learning/

http://cs231n.github.io/transfer-learning/

https://research.googleblog.com/2015/10/improving-youtube-video-thumbnails-with.html

https://research.googleblog.com/2015/10/improving-youtube-video-thumbnails-with.html

https://pseudoprofound.wordpress.com/2016/08/28/notes-on-the-tensorflow-implementation-of-inception-v3/

https://pseudoprofound.wordpress.com/2016/08/28/notes-on-the-tensorflow-implementation-of-inception-v3/

https://adeshpande3.github.io/adeshpande3.github.io/The-9-Deep-Learning-Papers-You-Need-To-Know-About.html

https://adeshpande3.github.io/adeshpande3.github.io/The-9-Deep-Learning-Papers-You-Need-To-Know-About.html






Internet

Intelligent Thumbnail Selection