Click here to load reader

Selecting Youtube Video Thumbnails via Convolutional ... · PDF fileSelecting Youtube Video Thumbnails via Convolutional Neural Networks ... was also performed and data augmentation

  • View

  • Download

Embed Size (px)

Text of Selecting Youtube Video Thumbnails via Convolutional ... · PDF fileSelecting Youtube Video...

  • Selecting Youtube Video Thumbnails via Convolutional Neural Networks

    Noah Arthurs

    Stanford [email protected]

    Sawyer Birnbaum

    Stanford [email protected]

    Nate Gruver

    Stanford [email protected]


    The success of a Youtube channel is driven in largepart by the quality of the thumbnails chosen to representeach video. In this paper, we describe a CNN architecturefor fitting the thumbnail qualities of successful videos andfrom there selecting the best thumbnail from the frames ofa video. Accuracy on par with a human benchmark wasachieved on the classification task, and the ultimate thumb-nail selector picked what we deemed reasonable framesabout 80% of the time. In depth analysis of the classifierwas also performed and data augmentation was used to at-tempt improvements on flaws noticed. Video category in-formation was also incorporated into a later model in anattempt to create more semantically fitting thumbnails. Ul-timately, the success of augmentation and additional se-mantic information at selecting good frames did not differmuch from earlier results but revealed promising qualitativestructures in the selection task.

    1. Introduction

    Every YouTube video is represented by a thumbnail, asmall image that, along with the title and channel, servesas the cover of the video. Thumbnails that are interest-ing and well-framed attract viewers, while those that areconfusing and low-quality encourage viewers to click else-where. As a testament to the important of a good thumb-nail, 90% of the most successful YouTube videos have cus-tom thumbnails [2]. YouTube uploaders without the timeor skills to create a custom thumbnail, however, must pickone of 3 frames automatically chosen from the video. Ourmission is to improve this frame selection process and helpuploaders select high quality frames that will attract viewersto their channel.

    We use a two phase process to select good thumbnails fora video. In the 1st phase, we train a convolutional neuralnetwork (CNN) to predict the quality of a video (encodedwith a binary-good/bad label) from the videos thumbnail.

    All authors contributed equally to this project

    In the next phase, we run a set of frames from a videothrough our model and use the softmax scores produced bythe algorithm to rank their quality as thumbnails. We canthen recommend the frames that achieve the highest scores.

    2. Related Work

    Because of its importance to both content creators andhosts of video sharing platforms, thumbnail selection hasbecome a major area of research in the last 10 years. Mostearly work in the field focused on thumbnail selectionthrough more classical feature selection techniques [32][15] [19]. The focus in these studies was primarily stream-lining the thumbnail selection pipeline [9] [15] [31][19] as well as selecting semantically relevant thumbnailsthrough regression [32] or metrics of mutual information[20] [13]. Only recently have thumbnail selection sys-tems begun to utilize convolutional neural networks. Hereagain we see three primary areas of focus. First, genera-tion of thumbnails that are correlated with video success[28] [26], second, generation of thumbnails for semanticrelevance using a mixture of CNNs and NLP [27] [21][29] [30], and, third, thumbnail selection through measur-able aesthetic qualities [24] [8].

    In this work we decided to focus on the first of these ar-eas, crafting thumbnails with an eye towards general videosuccess. This makes our work most like that of WeilongYang and Min-hsuan Tsai [28] at Google DeepMind. Tobuild on their accomplishments, we preformed a thoroughanalysis of our models inner working and employed dataaugmentation to help the model focus on thumbnail qualityrather than on video branding (i.e., with logos). We alsoattempted to incorporate the semantic information encapsu-lated in the videos category.

    3. Dataset and Methods

    Our project consists of three parts:

    1. Create a dataset of good and bad thumbnails.2. Train a classifier to distinguish between these two cat-



  • 3. Select thumbnails by choosing from each video theframes that have the highest probability of being goodthumbnails according to the classifier.

    3.1. Dataset

    For the 1st phase of our system, we gather thumbnailsfrom good and bad videos. We defined good andbad as a function of the number of views received bya video, under the assumption that a video with a highview count likely has a custom, well-designed thumbnail (asmentioned above, this is true for 90% of the most popularvideos), while videos with extremely few views likely haveunappealing thumbnails. Specifically, we labeled thumb-nails with more the 1 million views good and those withfewer than 100 views bad.

    To collect the good thumbnails, we downloaded (at most)5 videos with a million or more views from the 2,500 mostsubscribed YouTube channels [4]1. This method furtherensures that the good thumbnails we selected are customimages designed by experienced YouTube content creators.To select bad thumbnails, we considered videos selected bya psuedorandom algorithm and included those with fewerthan 100 views (which was about half of the total) [1].

    This process provided us with 5000 videos of eachclass. In general, as expected, the good class thumbnails arenoticeably higher-quality than the bad class ones, althoughthere is a fair amount of noise in the data. We set aside10% of the data for the test set and 20% of the data for thevalidation set.

    Example thumbnails: good on the left and bad on the right

    Both the good and bad thumbnails come from a diverseset of YouTube categories. The distribution of thumbnails,however, differs between the classes. Here are the mostcommon categories for both classes (with associatedfrequency counts):

    Frequency Good Bad1 Music (1335) Entertainment (2656)2 Howto & Style (1048) People & Blogs (807)3 News & Politics (672) Howto & Style (292)

    1We downloaded YouTube video data (thumbnails, frames, etc.) usingyoutube-dl and the YouTube Developer API [6] [5]

    While these distributional differences are a potentialsource of concern (we do not want the model to make labelpredictions based on predictions of thumbnails categorywithout considering their quality) the amount of diversitywithin each category and the similarities between some ofthe top categories, such as music and entertainment, forcethe model to evaluate images on more than a categoricallevel. In any event, our results suggest that the model isdiscriminating on more than just the image category. (See4.2)

    After downloading the thumbnails, we scale them toYouTubes recommended 16x9 resolution, cropping imagesif they are initially too tall and adding a black boarder if theyare initially too wide. (This corresponds with how YouTubehandles missized thumbnails.) The images are then resizedto 80x45 pixels using Lanczos resampling [10] to reducethe size of the model and allow for efficient training. Im-age resizing was performed using SciPy [16]. Lastly, wezero-centered and normalized.

    For the 2nd phase of our system, we downloaded 1frame per second from 84 videos across the 9 most pop-ular YouTube categories.2. We resized each frame to thesame resolution as the thumbnail data and created a set of10 frames per video. We spaced the frames out evenly foreach video to capture a somewhat representative sample ofthe scenes in the video and help ensure that frames could beeasily differentiated.

    3.2. Classifier

    Our classifiers are neural networks that take an imageas input and output a two-dimensional vector of scoresS = (s0, s1), where s0 is the score of the bad class ands1 is the score of the good class. We turn these scores intoprobabilities using the softmax function. Specifically, theprobability that the image is good (according to our model)is:

    P (y = 1) =es1

    es0 + es1

    where y refers to the true class of the image in question.Optimally, then, the model should assign a score of tothe incorrect class and score of for the correct class foreach example. In order to measure accuracy on the clas-sification task, we say that our model is classifying an im-age as good if P (y = 1) > 0.5 and bad otherwise. Twoof our group members tested ourselves on the classificationtask across 212 examples, and both achieved an accuracy of81.6%, which we consider our human benchmark.

    Our loss for example i of class yi is given by the softmaxcross entropy loss [12]:

    Li = log(P (y = yi)) = log(


    es0 + es1

    )2Music, Comedy, Film & Entertainment, Gaming, Beauty & Fashion,

    Sports, Tech, Cooking & Health, and News & Politics


  • Our loss for a batch of n data points consists of the averagecross entropy loss for that batch plus an L2 regularizationterm to encourage sparsity in the model and avoid overfit-ting [22]:

    L =1



    Li + w


    where is a regularization constant and the ws are theweights (not the biases) for the dense and convolutional lay-ers. We used TensorFlow [7] to implement our models andan Adam Optimizer [17] to minimize the loss function sothat we had an adaptive learning rate for each parameter.

    Each network starts with a series of convolutional lay-ers which downsample either by placing 2x2 max poolinglayers between the convolutions or by performing convolu-tion with a stride of 2. After the last convolutional layer,the activations are flattened and then put through a series ofdense layers, the last of which produces a two-dimensionaloutput, which are the scores.

    In addition to L2 regularization, we used dropout [25]to prevent overfitting in our network. We were able to traineffectively without Batch Normalization [14], so we did notincl

Search related