Reading Between The Lines: Object Localization Using Implicit Cues from Image Tags Sung Ju Hwang and Kristen Grauman University of Texas at Austin Jingnan

Reading Between The Lines: Object LocalizationUsing Implicit Cues from Image TagsSung Ju Hwang and Kristen GraumanUniversity of Texas at AustinJingnan LiIevgeniia Gutenko

BabyInfantKidChildHeadphonesRedCuteLaughing

BoyDogGrassBlueSkyPuppyRiverStreamSunColoradoNikon

Weakly labeled images

LampChairPainting

Table LampChair

BabyTableChair

BicyclePerson

Object detection approaches

0 Prioritize search windows within image, based on learned distribution of tags for speed.

0 Combine the models based on both tags + images for accuracy.

Sliding window object detector

Need to reduce the # of windows scanned

Appearance-based detector

Motivation

Idea: What can be predicted from the image before even looking at it and only with given tags?Both sets of tags suggest that mug appears on the image, but when considering that set of tags is based on what “catches they eye” first, then the area that object detector has to search can be narrowed.

Implicit Tag Feature Definitions0What implicit features can be

obtained from tags?

0Relative prominence of each object based on the order in the list.

0Scale cues implied by unnamed objects.

0The rough layout and proximity between objects based on the sequence in which tags are given.

Implicit Tag Feature Definitions

0Word presence and absence – bag-of-words representation

0wi denotes the number of times that tag-word i occurs in that image’s associated list of keywords for a vocabulary of N total possible words

0For most tag lists, this vector will consist of only binary entries saying whether each tag has been named or not

],...,[ 1 NwwW


0Tag rank – prominence of each object: certain things will be named before others

0 ri denotes the percentile ranks observed in the training data for that word (for entire vocabulary)

0Some objects have context-independent “noticeability”—such as baby or fire truck—often named first regardless of their scale or position.

R [r1,...,rN ]


0Mutual tag proximity - tagger will name prominent objects first, then move his/her eyes to some other objects nearby

0pi,j denotes the (signed) rank difference between tag words i and j for the given image.

0The entry is 0 when the pair is not present.

P [1

p1,2,1

p1,3,...,

1

p1,N,...,

1

p2,3,...,

1

pN 1,N

]

Modeling the localization distributions

0Relate defined tag-based features to the object detection (or combination)

0Model conditional probability density that the window contains the object of interest, given only these image tags:

0 - the target object category.

T W ,R,P

PO (X |T)

O

Modeling the localization distributions

0Use mixture of Gaussians model:0 - parameters of the mixture model obtained by

trained Mixture Density Network (MDN)0Training:

Classification: Novel image with no BBoxes.

PO (X |T) iN(X;i, i)i1

m

i,i, i

ComputerBicycleChair

MDN provides the mixturemodel representing mostlikely locations for the target object.

The top 30 most likely places for a car sampled according to modeled distribution based only on tags of the images.

Modulating or Priming the detector0 Use from the previous step and:

0 Combine with predictions with object detector based on appearance , A – appearance cues:

HOG: Part-based detector (deformable part model)

0 Use the model to rank sub-windows and run the detector on most probable locations only (“priming”).

0 Decision value of detectors is mapped to probability:

PO (X |T)

PO (X | A)

d(x,y,s)

PO (X (x,y,s) | A) 1

1exp( d(x,y,s))

Modulating the detector

0Balance appearance and tag-based predictions:

0Use all tags cues:

0Learn the weights w using detection scores for true detections and a number of randomly sampled windows from the background.

0Can add Gist descriptor to compare against global scene visual context.

0Goal: improve accuracy.

Priming the detector

0Prioritize the search windows according to0Assumption that object is present, and only

localization parameters (x,y,s) have to be estimated.

0Stop search when confident detection is found0Confidence ( >0.5)0Goal: improve efficiency.

PO (X |T)

Results0Datasets

0 LabelMe - use the HOG detector0 PASCAL - use the part-based detector

Note:Last three columns show the ranges of positions/scales present in the images, averaged per class, as a percentage of image size.

LP

LabelMe Dataset

• Priming Object Search: Increasing SpeedFor a detection rate of 0.6, proposed method considers only 1/3 of those scanned by the sliding window approach.

• Modulating the Detector: Increasing AccuracyThe proposed features make noticeable improvements in accuracy over the raw detector.

Example detections on LabelMe

• Each image shows the best detection found.

• Scores denote overlap ratio with ground truth.

• The detectors modulated according to the visual or tag-based context are more accurate.

PASCAL Dataset

0Priming Object Search: Increasing SpeedAdopt the Latent SVM (LSVM) part-based windowed detector, faster here than the HOG’s was on LabelMe.

0Modulating the Detector: Increasing AccuracyAugmenting the LSVM detector with the tag features noticeably improves accuracy—increasing the average precision by 9.2% overall.

Example detections on PASCAL VOC

0 Red dotted boxes denote most confident detections according to the raw detector (LSVM)

0 Green solid boxes denote most confident detections when modulated by our method (LSVM + tags)

0 The first two rows show good results, and third row shows failure cases

Conclusions0Novel approach to use information “between the lines”

of tags.

0Utilizing this implicit tag information helps to make search faster and more accurate.

0The method complements and even exceeds performance of the methods using visual cues.

0Shows potential for learning tendencies of real taggers.

Thank you!

Documents

Reading Between The Lines: Object Localization Using Implicit Cues from Image Tags Sung Ju Hwang and Kristen Grauman University of Texas at Austin Jingnan