Object Detection and Dense Captioning You Only Look Once: Unified, Real-Time Object Detection. Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi, CVPR 2016 DenseCap: Fully Convolutional Localization Networks for Dense Captioning. Justin Johnson, Andrej Karpathy, Li Fei-Fei, CVPR 2016 Dana Berman and Guy Leibovitz

Faster R-CNN

I Region ProposalNetwork (RPN)

• Anchor boxes(xa, ya,wa, ha)

• Predict:k × (tx , ty , tw , th)

x = xa + txwa

w = wa exp(tw)

I ROI pooling → classifierand bbox regression

Faster R-CNN - Limitations

I Training:

• NIPS 2015:alternatingoptimization

• arXiv 2016:end-to-end(approximately)

I Not real-time:0.2sec/image


YOLO - Overview

I You only look once

I Trainable end-to-end

YOLO - MethodInput image

YOLO - MethodImage is split into a gridWe split the image into a grid

YOLO - MethodEach cell predicts boxes and confidences: P(object)Each cell predicts boxes and confidences: P(Object)

YOLO - MethodEach cell predicts boxes and confidences: P(object)Each cell predicts boxes and confidences: P(Object)

YOLO - MethodEach cell predicts boxes and confidences: P(object)Each cell predicts boxes and confidences: P(Object)

YOLO - MethodEach cell also predicts a class probabilityEach cell also predicts a class probability.

YOLO - MethodClass probability is conditional: P(class|object)Each cell also predicts a class probability.


Bicycle Car

Dining Table

YOLO - MethodCombining the box and class predictionsThen we combine the box and class predictions.

YOLO - MethodNon-Maximal Suppression and threshold detectionsFinally we do NMS and threshold detections

YOLO - MethodThe output size is fixed.Each cell predicts:

I B bounding boxes. For each bounding box:I 4 coordinates (x , y ,w , h)I 1 confidence value P(object)

I N class probabilities P(class|object)

YOLO - MethodEach cell predicts:

- For each bounding box:- 4 coordinates (x, y, w, h)- 1 confidence value

- Some number of class probabilities

For Pascal VOC:

- 7x7 grid- 2 bounding boxes / cell- 20 classes

7 x 7 x (2 x 5 + 20) = 7 x 7 x 30 tensor = 1470 outputs

This parameterization fixes the output size

For Pascal VOC:

I 7× 7 grid

I B = 2 bounding boxes / cell

I N = 20 classes

7× 7× (2× 5 + 20) = 7× 7× 30 tensor

YOLO - MethodNeural Network

YOLO - MethodInspired by Inception from GoogLeNet

YOLO - MethodInception module (CVPR 2015):

YOLO - MethodOne neural network is trained to be the wholedetection pipeline

YOLO - MethodTraining:

I Pre-training conv. layers on ImageNet,using low-res input (1 week)

I For detection: add layers, increase imageresolution

I Normalize bounding box coordinates to [0, 1]

I Data augmentation: random scale, translation,exposure and saturation

I Loss function: L2

YOLO - MethodLoss function:

YOLO - FrameworkDarknet - Open source neural networks in Chttp://pjreddie.com/darknet/

YOLO - ResultsExample of results on natural images:

YOLO works across a variety of natural images

YOLO - ResultsIt also generalizes well to new domains (such as art):

It also generalizes well to new domains (like art)

YOLO - ResultsQuantitative detection and localization results:

YOLO - ResultsLimitations:

I Small objects

I Unusual aspect ratios

I Multiple objects per grid cell

Beyond YOLO

SSD: Single Shot MultiBox Detector

I ECCV 2016

I More accurate than Faster R-CNN


YOLO9000: Better, Faster, Stronger

I arXiv, 25 Dec 2016

I 9000 object classes

Lessons from SSD and YOLO9000

I Multi-scale feature mapsI Predict anchor box offsets

I NormalizedI h ∼ ha exp(t)I Aspect ratios

I Data augmentation (scale, brightness, etc.)

Dense Captioning

Background: Detection and CaptioningComputer Vision Tasks

Background: Visual Genome DatasetVisual Genome Dataset

108,077 images 5,408,689 regions + captions

Krishna et al, "Visual Genome", 2016

A boy wearing


A red tricycle

A red flying frisbeeTwo men playing frisbee

Wooden privacy fence

The ground is made of stone

The legsof a man

An athletic shoe on a foot


Questions?Justin Johnson*, Andrej Karpathy*, Li Fei-Fei

Stanford University

DenseCap: Fully Convolutional Localization Networks for Dense Captioning

Abstract Fully Convolutional Localization and Captioning Architecture Region Search by Text Query

Dense Captioning Results

Quantitative Evaluation

Task. We introduce the dense captioning task, which requires a computer vision system to both localize and describe salient regions in images in natural language. The dense captioning task generalizes object detection when the descriptions consist of a single word, and Image Captioning when one predicted region covers the full image. Model. To address the localization and description task jointly we propose a Fully Convolutional Localization Network (FCLN) architecture that processes an image with a single, efficient forward pass, requires no external regions proposals, and can be trained end-to-end with a single round of optimization. The architecture is composed of a Convolutional Network, a novel dense localization layer, and Recurrent Neural Network language model that generates the label sequences. Experiments. We evaluate our network on the Visual Genome dataset, which comprises 94,000 images and 4,100,000 region-grounded captions. We observe both speed and accuracy improvements over baselines based on current state of the art approaches in both generation and retrieval settings.

Dense Captioning task

Model Description (broken down)




A cat riding a skateboard




Dense CaptioningOrange spotted cat

Skateboard with red wheels

Cat riding a skateboard

Brown hardwood flooring

label densityWhole Image Image Regions

label complexity




Image: 3 x W x H Conv features:

C x W’ x H’

Region features:B x C x X x Y Region Codes:

B x D

LSTMStriped gray cat

Cats watching TV

Localization Layer


Region Proposals:4k x W’ x H’

Region scores:k x W’ x H’Conv features:

C x W’ x H’Bilinear Sampler Region features:

B x 512 x 7 x 7

Sampling Grid:B x X x Y x 2

Sampling Grid Generator

Best Proposals:B x 4

Recognition Network

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, Ren et al., NIPS 2015Spatial Transformer Networks, Jaderberg et al., NIPS 2015

white tennis shoes head of a giraffe red and white sign hands holding a phone

front wheelof a bus

A man and a woman sitting at a table with a cake. A train is traveling down the tracks near a forest.A large jetliner flying through a blue sky. A teddy bear with

a red bow on it.

Our Model:

Full Image RNN:

Image Retrieval by Bag of Text Queries

Better performance (5.39 vs. 4.26 mAP) 13X faster. Runs @ 4-20fps

Open Source Code ReleaseFind the code on Github!https://github.com/jcjohnson/densecap

- A pretrained model- Code to run the model on new images, on either CPU or GPU- Code to run a live demo with a webcam- Evaluation code for dense captioning- Instructions for training the model

Stop by poster #2!

where �x and �y are the parameters of a generic sampling kernel k() which defines the imageinterpolation (e.g. bilinear), U c

nm is the value at location (n, m) in channel c of the input, and V ci

is the output value for pixel i at location (xti, y

ti) in channel c. Note that the sampling is done

identically for each channel of the input, so every channel is transformed in an identical way (thispreserves spatial consistency between channels).

In theory, any sampling kernel can be used, as long as (sub-)gradients can be defined with respect toxs

i and ysi . For example, using the integer sampling kernel reduces (3) to

V ci =





U cnm�(bxs

i + 0.5c � m)�(bysi + 0.5c � n) (4)

where bx + 0.5c rounds x to the nearest integer and �() is the Kronecker delta function. Thissampling kernel equates to just copying the value at the nearest pixel to (xs

i , ysi ) to the output location

(xti, y

ti). Alternatively, a bilinear sampling kernel can be used, giving

V ci =





U cnm max(0, 1 � |xs

i � m|) max(0, 1 � |ysi � n|) (5)

To allow backpropagation of the loss through this sampling mechanism we can define the gradientswith respect to U and G. For bilinear sampling (5) the partial derivatives are

@V ci

@U cnm





max(0, 1 � |xsi � m|) max(0, 1 � |ys

i � n|) (6)

@V ci






U cnm max(0, 1 � |ys

i � n|)


0 if |m � xsi | � 1

1 if m � xsi

�1 if m < xsi


and similarly to (7) for @V ci



This gives us a (sub-)differentiable sampling mechanism, allowing loss gradients to flow back notonly to the input feature map (6), but also to the sampling grid coordinates (7), and therefore backto the transformation parameters ✓ and localisation network since @xs


@✓ and @xsi

@✓ can be easily derivedfrom (10) for example. Due to discontinuities in the sampling fuctions, sub-gradients must be used.This sampling mechanism can be implemented very efficiently on GPU, by ignoring the sum overall input locations and instead just looking at the kernel support region for each output pixel.

3.4 Spatial Transformer Networks

The combination of the localisation network, grid generator, and sampler form a spatial transformer(Fig. 2). This is a self-contained module which can be dropped into a CNN architecture at any point,and in any number, giving rise to spatial transformer networks. This module is computationally veryfast and does not impair the training speed, causing very little time overhead when used naively, andeven speedups in attentive models due to subsequent downsampling that can be applied to the outputof the transformer.

Placing spatial transformers within a CNN allows the network to learn how to actively transformthe feature maps to help minimise the overall cost function of the network during training. Theknowledge of how to transform each training sample is compressed and cached in the weights ofthe localisation network (and also the weights of the layers previous to a spatial transformer) duringtraining. For some tasks, it may also be useful to feed the output of the localisation network, ✓,forward to the rest of the network, as it explicitly encodes the transformation, and hence the pose, ofa region or object.

It is also possible to use spatial transformers to downsample or oversample a feature map, as one candefine the output dimensions H 0 and W 0 to be different to the input dimensions H and W . However,with sampling kernels with a fixed, small spatial support (such as the bilinear kernel), downsamplingwith a spatial transformer can cause aliasing effects.


Losses Dense Captioning Architecture

Convolutional Network

Recurrent Network

Localization Layer

Recognition Network

Joint training: Minimize five losses

1. Box regression (position) 2. Box classification (confidence)3. Box regression (position) 4. Box classification (confidence)5. Captioning

Captioning RNNDense Captioning: Prior Work

Region Proposals


Convolutional Network

START man throwing disc

man throwing disc END

START red frisbee

red frisbee END

START gray stone ground

gray stone ground END

Recurrent NetworkKarpathy and Fei-Fei, CVPR 2015



black computer monitorman wearing a blue shirt

sitting on a chair

people are in the background

computer monitor on a desk

silver handle on the wall

man with black hair

black bag on the floor

red and brown chair

wall is white

Additional Application - Finding RegionsGiven Description


Finding regions given descriptions“head of a giraffe”





Additional Application - Finding RegionsGiven Description


“front wheel of a bus”Finding regions given descriptions