Effective dataset construction in computer...

Preview:

Citation preview

Effective Dataset Construction in Computer Vision

Kota Yamaguchi

Recent progress in image recognition

0

5

10

15

20

25

30

2010 2011 2012 2013 2014

[Russakovsky 2014]

GoogLeNet 6.7%

Clarifai 11.7%

SuperVision 16.4%

XRCE 25.8%

NEC 28.2%

Human 5.1%

Ioffe et al. (arXiv) 4.9%

ILSVRC!image!classifica1on!task!

Steel!drum!

Scale T-shirt Steel drum Drumstick Mud turtle

ILSVRC classification error

Deep models and ...

Int J Comput Vis

Fig. 1 The best reported performance on PASCAL VOC challenge has

shown marked increases since 2006 (top). This could be due to various

factors: the dataset itself has evolved over time, the best-performing

methods differ across years, etc. In the bottom-row, we plot a particular

factor—training data size—which appears to correlate well with per-

formance. This begs the question: has the increase been largely driven

from the availability of larger training sets?

Fig. 2 We plot idealized curves of performance versus training dataset

size and model complexity. The effect of additional training examples

is diminished as the training dataset grows (left), while we expect per-

formance to grow with model complexity up to a point, after which an

overly-flexible model overfits the training dataset (right). Both these

notions can be made precise with learning theory bounds, see e.g.

(McAllester 1999)

1.1 Challenges

We found there is a surprising amount of subtlety in scaling

up training data sets in current systems. For a fixed model,

one would expect performance to generally increase with the

amount of data and eventually saturate (Fig. 2). Empirically,

we often saw the bizarre result that off-the-shelf implemen-

tations show decreased performance with additional data!

One would also expect that to take advantage of additional

training data, it is necessary to grow the model complexity,

in this case by adding mixture components to capture dif-

ferent object sub-categories and viewpoints. However, even

with non-parametric models that grow with the amount of

training data, we quickly encountered diminishing returns in

performance with only modest amounts of training data.

We show that the apparent performance ceiling is not

a consequence of HOG+linear classifiers. We provide an

analysis of the popular deformable part model (DPM), show-

ing that it can be viewed as an efficient way to implicitly

encode and score an exponentially-large set of rigid mixture

components with shared parameters. With the appropriate

sharing, DPMs produce substantial performance gains over

standard non-parametric mixture models. However, DPMs

have fixed complexity and still saturate in performance with

current amounts of training data, even when scaled to mix-

tures of DPMs. This difficulty is further exacerbated by the

computational demands of non-parametric mixture models,

which can be impractical for many applications.

1.2 Proposed Solutions

In this paper, we offer explanations and solutions for many

of these difficulties. First, we found it crucial to set model

regularization as a function of training dataset using cross-

validation, a standard technique which is often overlooked

in current object detection systems. Second, existing strate-

gies for discovering sub-category structure, such as cluster-

ing aspect ratios (Felzenszwalb et al. 2010), appearance fea-

tures (Divvala et al. 2012), and keypoint labels (Bourdev and

Malik 2009) may not suffice. We found this was related to

the inability of classifiers to deal with “polluted” data when

mixture labels were improperly assigned. Increasing model

complexity is thus only useful when mixture components

capture the “right” sub-category structure.

To efficiently take advantage of additional training data,

we introduce a non-parametric extension of a DPM which we

call an exemplar deformable part model (EDPM). Notably,

EDPMs increase the expressive power of DPMs with only a

negligible increase in computation, making them practically

useful. We provide evidence that suggests that compositional

representations of mixture templates provide an effective way

to help target the “long-tail” of object appearances by sharing

local part appearance parameters across templates.

Extrapolating beyond our experiments, we see the strik-

ing difference between classic mixture models and the non-

parametric compositional model (both mixtures of linear

classifiers operating on the same feature space) as evidence

that the greatest gains in the near future will not be had with

simple models + bigger data, but rather through improved

representations and learning algorithms.

We introduce our large-scale dataset in Sect. 2, describe

our non-parametric mixture models in Sect. 3, present exten-

sive experimental results in Sect. 4, and conclude with a dis-

cussion in Sect. 5 including related work.

123

X Zhu et al. Do We Need More Training Data? IJCV 2015

PASCAL VOC best performance

Data improvement? Model improvement?

We need both

Data drive statistical models

Training data Model

Testing data Results

Training

Testing

e.g., CNN

Object recognition datasets

2014

• ImageNet • Stanford Background

• SUN

2009 2004

• Caltech 101 • MSRC • ESP Game

• MS COCO • YFCC 100M

• TinyImage

WordNet (1985-)

• PASCAL VOC • Caltech 256 • LabelMe

• SBU1M

• UIUC

ImageNet • WordNet Hierarchy

• 14M images, 21K

synsets as of Apr 2015

• Used in ILSVRC

[Deng 2009]

Microsoft COCO • Over 300K

images, 2M instances

• Creative Commons

• Segmentation • 5 captions /

image

[Lin 2014]

Quality vs. scale

Scale

Quality

1M 100K 10K 1K 10M 100M 1B

In-house datasets

10B 100

Raw, user-generated data

• SBU1M • YFCC100M

Crowd-sourced datasets

• MS COCO

• ImageNet

• SUN • Caltech 101

Motivation

• Data is driving image recognition, but creating a good dataset is not easy

• Big-data challenges – Scalability – Quality

• How should we construct a dataset?

Agenda

Part I: dataset construction • Dataset construction • Collecting data • Annotating data • Crowdsourcing

• 10-min break

Part II: case studies • Data-driven clothing

parsing • Popularity analysis • Studying fashion styles • Studying fashion trends

Dataset construction

1. Decide the task – Classification, Detection, Segmentation, ...

2. Collecting data – Web, Fieldwork

3. Select and annotate data – Crowdsourcing

4. Your dataset is ready for use

To annotate, or not to annotate?

• If your task is ... – Supervised approach – Benchmarking

• If your task is ...

– Data mining – Weakly supervised approach

Need only minor annotations

Need a lot of annotations

Supervised scenario: classification

• Image-label pairs – Every picture must

be completely annotated

– Very clean CIFAR-10 dataset [Krizhevsky 2009]

D = x, y( ){ }

yÎ bird, cat, dog, ...{ }

Weakly-supervised scenario: tag prediction

• Image and user-attached tags – No annotation effort – Often missing tags – Not necessarily visual

D = x, z( ){ }zÎ canon, USA, vintage, ...{ }

Boulder, Colorado, city, historic, history, America, United States, urban, street, vintage, historical, ephemeral, classic, retro, brick, sign, signage, tavern, restaurant, cafe, dining, building, nostalgic, nostalgia, old, wall, door, window, Canon, architecture, Southwest

https://www.flickr.com/photos/29069717@N02/16772466913/

Dataset construction

• Purpose of the dataset greatly differs depending on the goal

• Be ambitious! – ImageNet [Deng et al, 2009]

• From WordNet to visual ontology – Visipedia [Perona et al, 2010]

• Constructing visual encyclopedia

COLLECTING VISUAL DATA

Collecting data

• Web • Fieldwork

Collecting data on the Web

• Approaches – Search Engine: Google, Bing – Web API: Flickr, SNS – Web scraping

• Considerations – Legal issues – Noise and distribution of online content – Storage

Keyword-based search

• Good for weakly-categorized images • Issues

– Data-size limitation – Variability

whippet

Query expansion [Deng 2009]

whippet dog whippet greyhound

whippet lebrel 惠比特犬

Translations

Synonyms

whippet

Web API

• Web-services sometimes provide a developer API

• Structured data – e.g., JSON, XML

www.flickr.com/services/developer

Web scraper

WWW

HTTP Client Parser

URL Queue Storage URLs

URLs Data

Page

Legal issues

• CAREFULLY READ TERMS – Service providers don’t like bad access

• especially if it harms their business • e.g., copying the entire website

– Users own copyright on their own content – Talk to an expert if unsure

• Recommendation

– Creative Commons (Flickr) – Citing URL instead of redistributing data

YFCC100M Yahoo Flickr Creative Commons 100M

http://yahoolabs.tumblr.com/post/89783581601/one-hundred-million-creative-commons-flickr-images

[Thomee et al, 2015]

• 100M Flickr Photos

• 49M geo-tagged • Image URLs • Title and Description • Tags

Could be used as a basis to build a new dataset upon

Quality of online data

Flickr description != caption

Vacation on the water One week vacation in the blue waters of Turkey was one of the best weeks in my life. On day in each bay just worrying about the sun and the water. One week without putting on shoes or using the phone. Paradise on earth!

https://www.flickr.com/photos/rspedro/8396863230/

Learning from online content

• User-generated content does not contain clean data – Non-visual texts / tags – Tags tend to have high precision, low recall – Frequency issue

• Hopefully, large data-size resolves issues

Bigger data help – Retrieval from SBU1M dataset

[Ordonez 2011]

Im2Text: Describing Images Using 1 Million

Captioned Photographs Vicente Ordonez, Girish Kulkarni, Tamara L. Berg

Stony Brook University

Method overview

Contributions

Method BLEU score

Global matching (1k) 0.0774 +- 0.0059

Global matching (10k) 0.0909 +- 0.0070

Global matching (100k) 0.0917 +- 0.0101

Global matching (1million) 0.1177 +- 0.0099

Global + Content matching (linear regression) 0.1215 +- 0.0071

Global + Content matching (linear SVM) 0.1259 +- 0.0060

High level information

BLEU score evaluation

• SBU Captioned Photo Dataset: A large novel data set containing 1 million

images from the web with associated captions written by people, filtered so that

the descriptions are likely to refer to visual content.

[http://tamaraberg.com/sbucaptions]

• A description generation method that utilizes global image representations to

retrieve and transfer captions from our data set to a query image.

• A description generation method that utilizes both global representations and

direct estimates of image content (objects, actions, stuff, attributes, and

scenes) to produce relevant image descriptions.

Dataset size

Good results

Amazing colours in

the sky at sunset with

the orange of the

cloud and the blue of

the sky behind.

Strange cloud formation literally flowing

through the sky like a river in relation to

the other clouds out there.

Fresh fruit and vegetables

at the market in Port

Louis Mauritius.

Clock tower

against the sky. Tree with red leaves in the field in

autumn.

One monkey on the tree in the

Ourika Valley Morocco A female mallard duck in the lake

at Luukki Espoo

The sun was

coming through

the trees while I

was sitting in my

chair by the river

Query Image

1,000 10,000 100,000 1,000,000

Our dog Zoe in her bed.

Interior design of modern white

and brown living room furniture

against white wall with a lamp

hanging.

The Egyptian cat statue by the

floor clock and perpetual motion

machine in the pantheon.

Man sits in a rusted car buried in

the sand on Waitarere beach.

Emma in her hat looking super

cute.

t i d ig f d hitLittle girl and her dog in

northern Thailand. They both

seemed interested in what

we were doing.

Past work on image

retrieval has shown that

small collections often

produce spurious

matches. Increasing data

set size has a significant

effect on the quality of

retrieved global matches.

Quantitative results also

reflect this (see table at

the bottom)

Kentucky cows in a field.

The cat in the window.

The sky is blue over the Gherkin. The boat ended up a kilometre

from the water in the middle of

the airstrip.

Tree beside the river. Water over the road.

Bad results Incorrect objects Incorrect context

Completely wrong

Human evaluation

Objects: 80 object categories using part-based deformable models and

compute distances with objects detected in the query image based on visual

attributes and raw visual descriptors.

Stuff: Detect stuff regions using a sliding window SVM scoring function with texton, color

and geometric features as input. We determine similarity with the query image using

product of SVM probabilities. (water, etc)

People/Actions: Detect people and pose using state-of-the-art methods and

compute person similarity using an attribute based representation of pose.

Scenes: Train classifiers using global features for 26 common scene types and use the

vector of classifier responses as a feature to compute similarity between images.

TFIDF: Rank the words in the returned set of image captions using their term-frequency inverse document frequency scores and follow a similar approach with the

keywords for each object detection in the matching image set. As a result we obtain text-based TFIDF scores and object-detection-based TFIDF scores.

Caption used Success rate

Original human caption 96.0%

Top caption 66.7%

Best from our top 4 captions 92.7%

SBU Captioned Photo Dataset

Under the sky of burning clouds. Stained glass

window in

Eusebius church.

From the big tower on the hill over

looking central Wakkanai.

Reflection of the clear blue sky in

the water.

An old roman wall by the

tower of London. A tree right around

the corner from our

house is this tree

after the snow fell it

was so beautiful.

Young baboon in

the campsite at

fish river canyon.

Not quite sure what the

name of this bird is. Saw

while walking along the

beach in Ocracoke, NC

Granite in green

glass. This is a boat I saw

while walking near the

house we rented.

Graffiti water

tower in Sidney,

Ohio.

Looks like this might have

been RCA building you can

still see the RCA dog in the

stained glass window

The old Premium

Oil Co. sign in

Green River, Utah.

Chimaki black floral

cosmetic bag with

handle.

Graffiti water

tower in Sidney,

Ohio.

The water tower

in downtown

Campbell.

The famous Liver bird atop

the Royal Liverpool

Insurance building near the

newly tarted up docks.

Evan playing in

the sand on a

calais beach in

France.

This is the old water

tower at the

Goodyear plant in

Cartersville, Georgia.

My house...yeah right. This

was the beach house we

stayed in with my family for

vacation, in the Outer Banks.

The water tower

in downtown

Campbell.

Old cone water

tower with Graffiti

in Detroit

Michigan

Matching using Global

Image Features

(GIST + Color)

Smallest house in paris

between red (on right)

and beige (on left).

Bridge to temple in

Hoan Kiem lake.

The water is

clear enough to

see fish

swimming

around in it.

A walk around the

lake near our house

with Abby.

Hangzhou bridge in

West lake.

The daintree river by

boat.

. . .

SBU Captioned Photo Dataset

Transfer Caption(s)

e.g. “The water is clear

enough to see fish

swimming around in it.”

1 million captioned images!

The bridge over the

lake on Suzhou Street.

The Daintree river by boat. Bridge over Cacapon river.

Iron bridge over the Duck river.

. . .

Transfer Caption(s)

e.g. “The bridge over the lake

on Suzhou Street.”

Rerank retrieved images using high level content (captions, object detections,

scene classification, stuff detections, people & actions)

sky

trees

water

building

bridge

One of the many stone bridges in town

that carry the gravel carriage roads.

An old bridge over dirty green water.

A stone bridge over a peaceful river.

Computer Vision

Our Goal

The view from the 13th floor of an

apartment building in Nakano awesome.

Please choose the

image that better

corresponds to the

given caption:

In addition, we propose a new evaluation task where a user is presented with two photographs

and one caption. The user must assign the caption to the most relevant image. For evaluation we

use a query image, a random image and a generated caption.

Im2Text: Describing Images Using 1 Million

Captioned Photographs Vicente Ordonez, Girish Kulkarni, Tamara L. Berg

Stony Brook University

Method overview

Contributions

Method BLEU score

Global matching (1k) 0.0774 +- 0.0059

Global matching (10k) 0.0909 +- 0.0070

Global matching (100k) 0.0917 +- 0.0101

Global matching (1million) 0.1177 +- 0.0099

Global + Content matching (linear regression) 0.1215 +- 0.0071

Global + Content matching (linear SVM) 0.1259 +- 0.0060

High level information

BLEU score evaluation

• SBU Captioned Photo Dataset: A large novel data set containing 1 million

images from the web with associated captions written by people, filtered so that

the descriptions are likely to refer to visual content.

[http://tamaraberg.com/sbucaptions]

• A description generation method that utilizes global image representations to

retrieve and transfer captions from our data set to a query image.

• A description generation method that utilizes both global representations and

direct estimates of image content (objects, actions, stuff, attributes, and

scenes) to produce relevant image descriptions.

Dataset size

Good results

Amazing colours in

the sky at sunset with

the orange of the

cloud and the blue of

the sky behind.

Strange cloud formation literally flowing

through the sky like a river in relation to

the other clouds out there.

Fresh fruit and vegetables

at the market in Port

Louis Mauritius.

Clock tower

against the sky. Tree with red leaves in the field in

autumn.

One monkey on the tree in the

Ourika Valley Morocco A female mallard duck in the lake

at Luukki Espoo

The sun was

coming through

the trees while I

was sitting in my

chair by the river

Query Image

1,000 10,000 100,000 1,000,000

Our dog Zoe in her bed.

Interior design of modern white

and brown living room furniture

against white wall with a lamp

hanging.

The Egyptian cat statue by the

floor clock and perpetual motion

machine in the pantheon.

Man sits in a rusted car buried in

the sand on Waitarere beach.

Emma in her hat looking super

cute.

t i d ig f d hitLittle girl and her dog in

northern Thailand. They both

seemed interested in what

we were doing.

Past work on image

retrieval has shown that

small collections often

produce spurious

matches. Increasing data

set size has a significant

effect on the quality of

retrieved global matches.

Quantitative results also

reflect this (see table at

the bottom)

Kentucky cows in a field.

The cat in the window.

The sky is blue over the Gherkin. The boat ended up a kilometre

from the water in the middle of

the airstrip.

Tree beside the river. Water over the road.

Bad results Incorrect objects Incorrect context

Completely wrong

Human evaluation

Objects: 80 object categories using part-based deformable models and

compute distances with objects detected in the query image based on visual

attributes and raw visual descriptors.

Stuff: Detect stuff regions using a sliding window SVM scoring function with texton, color

and geometric features as input. We determine similarity with the query image using

product of SVM probabilities. (water, etc)

People/Actions: Detect people and pose using state-of-the-art methods and

compute person similarity using an attribute based representation of pose.

Scenes: Train classifiers using global features for 26 common scene types and use the

vector of classifier responses as a feature to compute similarity between images.

TFIDF: Rank the words in the returned set of image captions using their term-frequency inverse document frequency scores and follow a similar approach with the

keywords for each object detection in the matching image set. As a result we obtain text-based TFIDF scores and object-detection-based TFIDF scores.

Caption used Success rate

Original human caption 96.0%

Top caption 66.7%

Best from our top 4 captions 92.7%

SBU Captioned Photo Dataset

Under the sky of burning clouds. Stained glass

window in

Eusebius church.

From the big tower on the hill over

looking central Wakkanai.

Reflection of the clear blue sky in

the water.

An old roman wall by the

tower of London. A tree right around

the corner from our

house is this tree

after the snow fell it

was so beautiful.

Young baboon in

the campsite at

fish river canyon.

Not quite sure what the

name of this bird is. Saw

while walking along the

beach in Ocracoke, NC

Granite in green

glass. This is a boat I saw

while walking near the

house we rented.

Graffiti water

tower in Sidney,

Ohio.

Looks like this might have

been RCA building you can

still see the RCA dog in the

stained glass window

The old Premium

Oil Co. sign in

Green River, Utah.

Chimaki black floral

cosmetic bag with

handle.

Graffiti water

tower in Sidney,

Ohio.

The water tower

in downtown

Campbell.

The famous Liver bird atop

the Royal Liverpool

Insurance building near the

newly tarted up docks.

Evan playing in

the sand on a

calais beach in

France.

This is the old water

tower at the

Goodyear plant in

Cartersville, Georgia.

My house...yeah right. This

was the beach house we

stayed in with my family for

vacation, in the Outer Banks.

The water tower

in downtown

Campbell.

Old cone water

tower with Graffiti

in Detroit

Michigan

Matching using Global

Image Features

(GIST + Color)

Smallest house in paris

between red (on right)

and beige (on left).

Bridge to temple in

Hoan Kiem lake.

The water is

clear enough to

see fish

swimming

around in it.

A walk around the

lake near our house

with Abby.

Hangzhou bridge in

West lake.

The daintree river by

boat.

. . .

SBU Captioned Photo Dataset

Transfer Caption(s)

e.g. “The water is clear

enough to see fish

swimming around in it.”

1 million captioned images!

The bridge over the

lake on Suzhou Street.

The Daintree river by boat. Bridge over Cacapon river.

Iron bridge over the Duck river.

. . .

Transfer Caption(s)

e.g. “The bridge over the lake

on Suzhou Street.”

Rerank retrieved images using high level content (captions, object detections,

scene classification, stuff detections, people & actions)

sky

trees

water

building

bridge

One of the many stone bridges in town

that carry the gravel carriage roads.

An old bridge over dirty green water.

A stone bridge over a peaceful river.

Computer Vision

Our Goal

The view from the 13th floor of an

apartment building in Nakano awesome.

Please choose the

image that better

corresponds to the

given caption:

In addition, we propose a new evaluation task where a user is presented with two photographs

and one caption. The user must assign the caption to the most relevant image. For evaluation we

use a query image, a random image and a generated caption.

Im2Text: Describing Images Using 1 Million

Captioned Photographs Vicente Ordonez, Girish Kulkarni, Tamara L. Berg

Stony Brook University

Method overview

Contributions

Method BLEU score

Global matching (1k) 0.0774 +- 0.0059

Global matching (10k) 0.0909 +- 0.0070

Global matching (100k) 0.0917 +- 0.0101

Global matching (1million) 0.1177 +- 0.0099

Global + Content matching (linear regression) 0.1215 +- 0.0071

Global + Content matching (linear SVM) 0.1259 +- 0.0060

High level information

BLEU score evaluation

• SBU Captioned Photo Dataset: A large novel data set containing 1 million

images from the web with associated captions written by people, filtered so that

the descriptions are likely to refer to visual content.

[http://tamaraberg.com/sbucaptions]

• A description generation method that utilizes global image representations to

retrieve and transfer captions from our data set to a query image.

• A description generation method that utilizes both global representations and

direct estimates of image content (objects, actions, stuff, attributes, and

scenes) to produce relevant image descriptions.

Dataset size

Good results

Amazing colours in

the sky at sunset with

the orange of the

cloud and the blue of

the sky behind.

Strange cloud formation literally flowing

through the sky like a river in relation to

the other clouds out there.

Fresh fruit and vegetables

at the market in Port

Louis Mauritius.

Clock tower

against the sky. Tree with red leaves in the field in

autumn.

One monkey on the tree in the

Ourika Valley Morocco A female mallard duck in the lake

at Luukki Espoo

The sun was

coming through

the trees while I

was sitting in my

chair by the river

Query Image

1,000 10,000 100,000 1,000,000

Our dog Zoe in her bed.

Interior design of modern white

and brown living room furniture

against white wall with a lamp

hanging.

The Egyptian cat statue by the

floor clock and perpetual motion

machine in the pantheon.

Man sits in a rusted car buried in

the sand on Waitarere beach.

Emma in her hat looking super

cute.

t i d ig f d hitLittle girl and her dog in

northern Thailand. They both

seemed interested in what

we were doing.

Past work on image

retrieval has shown that

small collections often

produce spurious

matches. Increasing data

set size has a significant

effect on the quality of

retrieved global matches.

Quantitative results also

reflect this (see table at

the bottom)

Kentucky cows in a field.

The cat in the window.

The sky is blue over the Gherkin. The boat ended up a kilometre

from the water in the middle of

the airstrip.

Tree beside the river. Water over the road.

Bad results Incorrect objects Incorrect context

Completely wrong

Human evaluation

Objects: 80 object categories using part-based deformable models and

compute distances with objects detected in the query image based on visual

attributes and raw visual descriptors.

Stuff: Detect stuff regions using a sliding window SVM scoring function with texton, color

and geometric features as input. We determine similarity with the query image using

product of SVM probabilities. (water, etc)

People/Actions: Detect people and pose using state-of-the-art methods and

compute person similarity using an attribute based representation of pose.

Scenes: Train classifiers using global features for 26 common scene types and use the

vector of classifier responses as a feature to compute similarity between images.

TFIDF: Rank the words in the returned set of image captions using their term-frequency inverse document frequency scores and follow a similar approach with the

keywords for each object detection in the matching image set. As a result we obtain text-based TFIDF scores and object-detection-based TFIDF scores.

Caption used Success rate

Original human caption 96.0%

Top caption 66.7%

Best from our top 4 captions 92.7%

SBU Captioned Photo Dataset

Under the sky of burning clouds. Stained glass

window in

Eusebius church.

From the big tower on the hill over

looking central Wakkanai.

Reflection of the clear blue sky in

the water.

An old roman wall by the

tower of London. A tree right around

the corner from our

house is this tree

after the snow fell it

was so beautiful.

Young baboon in

the campsite at

fish river canyon.

Not quite sure what the

name of this bird is. Saw

while walking along the

beach in Ocracoke, NC

Granite in green

glass. This is a boat I saw

while walking near the

house we rented.

Graffiti water

tower in Sidney,

Ohio.

Looks like this might have

been RCA building you can

still see the RCA dog in the

stained glass window

The old Premium

Oil Co. sign in

Green River, Utah.

Chimaki black floral

cosmetic bag with

handle.

Graffiti water

tower in Sidney,

Ohio.

The water tower

in downtown

Campbell.

The famous Liver bird atop

the Royal Liverpool

Insurance building near the

newly tarted up docks.

Evan playing in

the sand on a

calais beach in

France.

This is the old water

tower at the

Goodyear plant in

Cartersville, Georgia.

My house...yeah right. This

was the beach house we

stayed in with my family for

vacation, in the Outer Banks.

The water tower

in downtown

Campbell.

Old cone water

tower with Graffiti

in Detroit

Michigan

Matching using Global

Image Features

(GIST + Color)

Smallest house in paris

between red (on right)

and beige (on left).

Bridge to temple in

Hoan Kiem lake.

The water is

clear enough to

see fish

swimming

around in it.

A walk around the

lake near our house

with Abby.

Hangzhou bridge in

West lake.

The daintree river by

boat.

. . .

SBU Captioned Photo Dataset

Transfer Caption(s)

e.g. “The water is clear

enough to see fish

swimming around in it.”

1 million captioned images!

The bridge over the

lake on Suzhou Street.

The Daintree river by boat. Bridge over Cacapon river.

Iron bridge over the Duck river.

. . .

Transfer Caption(s)

e.g. “The bridge over the lake

on Suzhou Street.”

Rerank retrieved images using high level content (captions, object detections,

scene classification, stuff detections, people & actions)

sky

trees

water

building

bridge

One of the many stone bridges in town

that carry the gravel carriage roads.

An old bridge over dirty green water.

A stone bridge over a peaceful river.

Computer Vision

Our Goal

The view from the 13th floor of an

apartment building in Nakano awesome.

Please choose the

image that better

corresponds to the

given caption:

In addition, we propose a new evaluation task where a user is presented with two photographs

and one caption. The user must assign the caption to the most relevant image. For evaluation we

use a query image, a random image and a generated caption.

Power-laws

Limited vocabulary appears extremely large number of times

Most of the words are rare

frequency

Long tail

figure: wikipedia

• Frequency of tag words • Content popularity

f (x) = a xk

Family of distributions of the form:

Collecting data in the field

• Full control over data – Sensor types

• RGB-D, Panorama

– Quality • No copyright issue

• Cost and scalability

Data collection summary

• Web, fieldwork, building on existing dataset

• Legal concerns • Quality issues

• Probably bigger is better

– Deep learning requires big data

ANNOTATING DATA

Annotation process

D = x, y( ){ }

Output: Annotated data

¢D = x, z( ){ }

Input: Raw data

Weak labels • Search-keywords • Tags • GPS (?)

Clean labels • Image-labels • Bounding-boxes • Pixel-labels

Annotation system

Image

Designing annotation tools Types of annotations • Category tag • Bounding box

– Human pose • Segmentation

– Polygon – Super-pixels

• Natural language • Attributes • Tracking

Tools • HTML / JavaScript • Web server to host

images

Bounding boxes

Segmentation

• Pixel-wise labels • Approximation

– polygons – super-pixels

LabelMe [Russell 2007]

Natural language

• Image-text pairs

• Decomposable with NLP techniques – Attribute: Adj + Noun – Action: Noun + Verb

• Sentence generation

• One jet lands at an airport while another

takes off next to it.

• Two airplanes parked in an airport.

• Two jets taxi past each other.

• Two parked jet airplanes facing opposite

directions.

• two passenger planes on a grassy plain

UIUC Pascal Sentence [Rashtchian 2010]

Relative attributes [Parikh 2011]

> natural

< smiling

Slide credit: Devi Parikh

Is this natural?

Object annotation in videos

Vatic [Vondrick 2012]

Choosing the right task

• The more difficult the task is, the more expensive annotation becomes – (worker) time is money – Very difficult task results in poor quality

• Decompose a very complicated task into

multiple simple tasks – e.g., Single task to detect ALL objects ->

Multiple tasks to detect specific objects

Scaling up multi-label annotation

• Goal: Efficiently labeling hundreds of categories

Table Chair Horse Dog Cat Bird ...

+ + - - - - + - - - + - + + - - - -

[Deng 2014]

~1000 ?

Hierarchy, sparsity, correlation [Deng 2014]

Is there a table?

Is there a chair?

Is there a horse?

Is there a dog?

Is there a cat?

Is there a bird?

Naively asking 1000 labels

Is there an animal?

Is there a mammal?

Is there a cat?

Hierarchical questions

No table, chair

Table Chair Horse Dog Cat Bird

No bird

Probably no horse?

Human-in-the-loop approach

• Active learning – Only annotating

uncertain instances – Lower costs – Faster learning

Unannotated images

Annotated images

Selection Model

Selected images

Annotators

[Vijayanarasimhan 2011]

Pitfall: Asking people for validation

Is this an airplane? • Answer yes if a green

rectangle is drawn around an airplane. Otherwise answer no.

Rule-of-thumb: Ask to annotate ground-truth

yes no

Machines are very unlikely to produce 100% correct detection

Crowdsourcing market

Online workers

$$$

Requester

Tasks

• Image classification • Object detection • Segmentation • Language

description

Result

Monetary rewards

Amazon Mechanical Turk, CrowdFlower, etc.

Non-US people probably need somebody in US or agents to use MTurk as of 2015...

Amazon MTurk demographics

• Country: 80% US, 20% India

• Gender: 50% Male, 50% Female

• Age: – 50% 30 years old – 20% 20 years old – 20% 40 years old

P Ipeirotis, Demographics of Mechanical Turk: Now Live! (April 2015 edition) http://www.behind-the-enemy-lines.com

http://blogs.scientificamerican.com/guilty-planet/

Workers are not the same P Welinder et al., The Multidimensional Wisdom of Crowds, NIPS 2010

• One annotator = one classifier for ``duck’’ presence • Estimated decision parameters from Bayesian model • Groups 1, 2, 3 have different decision boundary

Quality control

• There are always sloppy annotators – Think of a bot randomly clicking on buttons

• Have a qualification test • Insert JavaScript to validate answers

– Reject too short or fast answers • Assign multiple annotators per task • Control worker motivation

– Feedback, Gamification

Qualification tests

• MTurk can be set up to allow only workers passing a qualification test – Prepare test questions and gold-

standard answers – Useful to assess, e.g., writing

ability

• Also possible to validate workers during the main tasks

Annotation tasks

Qualification test

Good Bad

Giving feedback to workers

• Feedback motivates workers – Also include a comment form to get their

opinion

100% 0% 50%

You’re the rookie! You’ve annotated 20 images You have only 20 images left!

Game-based annotation

• ESP Game

• Pros – No MTurk – Motivation

• Cons – Cheating – Bias

many.corante.com

[Ahn 2004]

ReferItGame [Kazemzadeh 2014]

guy in front

guy in front

Player 1

Player 2 man in red shirt

man in red shirt

Write a referring expression

Click on the referred object

Finding experts on the Web • Quizz

– Free medical quiz using targeted ads

– Knowledgeable volunteers, without monetary rewards

– Much faster with better quality

• Do monetary rewards harm quality?

[Ipeirotis 2014]

Crowdsourcing considerations

• Know your workers

• Quality control – Fun tasks attract motivated workers!

Part I: dataset construction

• What is your dataset for? – Know your task

• Collecting data – On Web or fields, Quality and big data

• Annotating data – Designing the right tool

• Crowdsourcing – Workers and quality control

Part II: Case studies

• Data-driven clothing parsing • Popularity analysis • Studying fashion styles • Studying fashion trends

DATA-DRIVEN CLOTHING PARSING

CVPR 2012, ICCV 2013

style.com

Clothing parsing

Pose and clothing

Semantic segmentation Pose estimation [Shotton 06] [Gould 09] [Liu 09] [Eigen 12]

[Singh 13] [Tighe 10, 13, 14] [Dong 13] [Ferrari 08] [Bourdev 09] [Yang 11]

[Dantone 13] [Ladicky 13]

Online fashion networks

Chictopia Lookbook Chicisimo Pinterest Tumblr ...

www.chictopia.com

Datasets

• Fashionista dataset – Small, completely annotated images – For supervised learning

• Paper Doll dataset

– Large-scale tagged images – For semi-supervised approach

Fashionista dataset

• 685 images – pose annotation – super-pixel labels

• Manually picked

images from Chictopia

• Crowd-sourced annotation

(a) Superpixels (b) Pose estimation

null

shorts

shoes

purse

top

necklace

hair

skin

(c) Predicted Clothing Parse (d) Pose re-estimation

Figure 2: Clothing parsing pipeline: (a) Parsing the image into Superpixels [1], (b) Original pose estimation using state ofthe art flexible mixtures of parts model [27]. (c) Precise clothing parse output by our proposed clothing estimation model(note the accurate labeling of items as small as the wearer’s necklace, or as intricate as her open toed shoes). (d) Optional re-estimate of pose using clothing estimates (note the improvement in her left arm prediction, compared to the original incorrectestimate down along the side of her body).

garment retrieval application (Fig 1).

Our main contributions include:• A novel dataset for studying clothing parsing, consist-ing of 158,235 fashion photos with associated text an-notations, and web-based tools for labeling.

• An effective model to recognize and precisely parsepictures of people into their constituent garments.

• Initial experiments on how clothing prediction mightimprove state of the art models for pose estimation.

• A prototype visual garment retrieval application thatcan retrieve matches independent of pose.Of course, clothing estimation is a very challenging

problem. The number of garment types you might observein a day on the catwalk of a New York city street is enor-mous. Add variations in pose, garment appearance, lay-ering, and occlusion into the picture, and accurate cloth-ing parsing becomes formidable. Therefore, we considera somewhat restricted domain, fashion photos from Chic-topia.com. These highly motivated users – fashionistas –upload individual snapshots (often full body) of their outfi tsto the website and usually provide some information relatedto the garments, style, or occasion for the outfi t. This allowsus to consider the clothing labeling problem in two scenar-ios: 1) a constrained labeling problem where we take theusers’noisy and perhaps incomplete tags as the list of pos-sible garment labels for parsing, and 2) where we considerall garment types in our collection as candidate labels.

1.1. Related W ork

Clothing recognition: Though clothing items determinemost of the surface appearance of the everyday human,there have been relatively few attempts at computationalrecognition of clothing. Early clothing parsing attempts fo-cused on identifying layers of upper body clothes in very

limited situations [2]. Later work focused on grammati-cal representations of clothing using artists’ sketches [6].Freifeld and Black [13] represented clothing as a defor-mation from an underlying body countour, learned fromtraining examples using principal component analysis toproduce eigen-clothing. M ost recently attempts have beenmade to consider clothing items such as t-shirt or jeans assemantic attributes of a person, but only for a limited num-ber of garments [4]. Different from these past approaches,we consider the problem of estimating a complete and pre-cise region based labeling of a person’s outfi t, for generalimages with a large number of potential garment types.

Clothing items have also been used as implicit cues ofidentity in surveillance scenarios [26], to find people in animage collection of an event [11, 22, 25], to estimate occu-pation [23], or for robot manipulation [16]. Our proposedapproach could be useful in all of these scenarios.

Pose Estimation: Pose estimation is a popular and wellstudied enterprise. Some previous approaches have con-sidered pose estimation as a labeling problem, assigningmost likely body parts to superpixels [18], or triangulatedregions [20]. Current approaches often model the body asa collection of small parts and model relationships amongthem, using conditional random fields [19, 9, 15, 10], or dis-criminative models [8]. Recent work has extended patchesto more general poselet representations [5, 3], or incorpo-rated mixtures of parts [27] to obtain state of the art results.Our pose estimation subgoal builds on this lastmethod [27],extending the approach to incorporate clothing estimationsin models for pose identification.

Image Parsing: Image parsing has been studied as a steptoward general image understanding [21, 12, 24]. We con-sider a similar problem (parsing) and take a related ap-

Annotation tools Web-based annotation tools at Amazon Mechanical Turk

Lesson: Segmentation is too hard for MTurk workers

CRF-based parsing

null

shoes

shirt

jeans

hair

skin

null

tights

jacket

dress

hat

heels

hair

skin

null

shorts

blouse

bracelet

wedges

hair

skin

null

shoes

top

stockings

hair

skin

Figure 4: Successful results on the Fahionista dataset.

null

purse

boots

sweater

hat

bracelet

hair

skin

(a) Skin-like color

null

t-shirt

shoes

jacket

hair

skin

(b) Failing pose estimate

null

tights

boots

jacket

dress

hat

hair

skin

(c) Spill in the background

null

purse

dress

accessories

belt

heels

hair

skin

(d) Coarse pattern

Figure 5: Failure cases

the art [27]) As motivation for future research on clothingestimation, we also find that given true clothing labels ourpose re-estimation system reaches a PCPof 89.5% , demon-strating the potential usefulness of incorporating clothinginto pose identification.

4.4. Retrieving Visually Similar Garments

We build a prototype system to retrieve garment itemsvia visual similarity in the Fashionista dataset. For eachparsed garment item, we compute normalized histogramsof RGB and L*a*b* color within the predicted labeled re-

CVPR 12

Failure cases CVPR 12

null

shoes

shirt

jeans

hair

skin

null

tights

jacket

dress

hat

heels

hair

skin

null

shorts

blouse

bracelet

wedges

hair

skin

null

shoes

top

stockings

hair

skin

Figure 5: Example successful results on the Fahionista dataset.

null

purse

boots

sweater

hat

bracelet

hair

skin

(a) Skin-like color

null

t-shirt

shoes

jacket

hair

skin

(b) Failing pose estimate

null

tights

boots

jacket

dress

hat

hair

skin

(c) Spill in the background

null

purse

dress

accessories

belt

heels

hair

skin

(d) Coarse pattern

Figure 6: Example failure cases

4.4. Retr ieving Visually Similar Garments

We build a prototype system to retrieve garment itemsvia visual similarity in the Fashionista dataset. For eachparsed garment item, we compute normalized histogramsof RGB and L*a*b* color within the predicted labeled re-

gion, and measure similarity between items by Euclideandistance. For retrieval, we prepare a query image and obtaina list of images ordered by visual similari ty. Figure 1 showsa few of top retrieved results for images displaying shorts,blazer , and t-shirt (query in leftmost col, retrieval results

Paper Doll parsing Retrieval-based approach

Paper Doll Dataset

NN

images NN

images Similar

images

Candidate tags

Image Parser

Tagged images

Web 1. Get tagged images on the Web 2. Retrieve similar images 3. Use them to predict items

Tagged images on the Web

Dress Hat

Heels Sweater Heels

Paper Doll dataset ~339,000 images

Retrieval example

bag cardigan heels shorts top

boots skirt

flats necklace shirt skirt

belt pumps skirt t-shirt

belt shirt shoes skirt tights

skirt top

Query

dress shoes skirt tights belt top

Candidate tags ...

...

Mixture of retrieval-based methods Global parsing

NN parsing

Transferred parsing

Combined parsing

Combine predictions

Final parsing

Smoothing

Input image

Similar styles

Candidate items

Results Input Truth Paper Doll CRF

Results Input Truth CRF Paper Doll

Big data benefits: performance

Data size

*CRF baseline doesn’t use big data

Big data benefits: qualitatively

Input

skin

hair

bag

boots

dress

skirt

top

Data size = 256

Data size = 262,144

accessories bag boots dress necklace shoes shorts skirt top

bag boots dress heels skirt sunglasses top

Parsing

Data-driven clothing parsing

• Fashionista dataset – small – completely-annotated

• Paper Doll dataset

– large – user-annotated

POPULARITY ANALYSIS MM2014

Predicted most popular

Predicted least popular

Popularity prediction

Regression analysis in 300K posts

• Tag TF-IDF • Image

composition • Color entropy • Style descriptor • Parse descriptor

Popularity

• User identity • Previous posts • Node degrees

Input Output

Social factors

Content factors

• Votes

Like button in Chictopia

Long tail

Promotion effect?

Findings

• The outfit doesn’t matter (!!!)

• Popularity is mostly the outcome of the social network – social bias – #votes ∝ #followers – People just click on friends’ photos

Regression performance Factors R2 Spearman Accuracy

top 25% Accuracy top 75%

Social 0.491 0.682 0.847 0.779 Content 0.248 0.488 0.778 0.737 Social + Content

0.493 0.685 0.845 0.775

Social factors significantly boost the performance

Rich-get-richer phenomena

• Popularity growth of a linked content is proportional to the current popularity

Easley and Kleinberg 2010

What if there is no social network?

• Popularity = f ( content factors )?

Crowdsourcing! • Collecting popularity

votes in Amazon MTurk

• No network!

3000 pictures 25 assignments

Out-of-network popularity

#images

#votes

No social factor in the voting process

Task

• Predict crowd popularity using Content factors and/or Social factors in Chictopia

Social factors

Chictopia

Content factors

MTurk

Voting data

?

Predicting crowd votes Factors R2 Spearman Accuracy

top 25% Accuracy top 75%

Social 0.423 0.634 0.845 0.787 Content 0.428 0.647 0.888 0.862 Social + Content

0.473 0.686 0.884 0.858

• Content factors matter • Social factors from Chictopia predict crowd votes well

• User-content correlation: Top-bloggers consistently post

good pictures

Lessons

• Crowdsourcing is not only for getting ground-truth, but to study human behavior

• Research opportunity for social visual media

STUDYING FASHION STYLES ECCV2014

Q: What makes the boy on the right look Harajuku-style?

Tie? Shoes?

tokyofashion.com

Goal

• Finding what constitutes a fashion style

• Approach – Game-based annotation – Attribute factorization

Goth

Who’s more Bohemian?

Other people think...

hipsterwars.com

hipsterwars.com Game-based relative ``style-ness’’ collection Asking our online friends for participation NO MONETARY REWARDS! Initial keyword-search on Google or Fashion SNS

Participation statistics

Most played the game only a few clicks

Some motivated users clicked A LOT

TrueSkill game algorithm

• Algorithm to select which pair to play

• Idea: – Represent each image by Gaussian over

rating – Update Gaussian parameters after each click – Chooses expected-to-tie images for play

[R Herbrich, 2007]

Score distribution after game

Most Hipster

Least Hipster

Annotation examples

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

ECCV

# 1534ECCV

# 1534

10 ECCV-14 submission ID 1534

Most (Predicted) Least (Predicted)

Pin

up

G

oth

H

ipste

r B

ohe

mia

n

Pre

ppy

Fig. 5: Example results of within-classification task with δ = 0.5. Top and bottompredictions for each style category are shown.

5.2 W ith in -class classifi cat ion

Our next style recognition tasks considers classification between top rated andbottom rated examples for each style independently. Here the goal is, for ex-ample, to determine whether a person is an uber-hipster or only sort of hipster.Again, we utilize linear SVM s [27], but here learn one visual model for each stylesin our dataset. Here δ determines the percentage of top and bottom ranked im-ages used in the classification task. For example, δ = 0.1 means that we usethe top rated 10% of images from a style as positive samples and the bottomrated 10% of samples from the same style as negative samples (using the ratingscomputed in Sec 3.2) . W e evaluate experiments for δ ranging from 10% to 50% .W e repeat the experiments for 100 random folds with a 9 : 1 train to test ratio.I n each experiment, C , is determined using 5 fold cross-validation.Results are reported in F igure 6. W e observe that when δ is small we generally

have better performance than for larger δ. T his is because the classification taskgenerally becomes more challenging as we add less extreme examples of eachstyle. Additionally, we find best performance on the pinup category. Performanceon the goth category comes in second. For the hipster category, we do quite wellat di↵erentiating between extremely strong or weak examples, but performance

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

ECCV

# 1534ECCV

# 1534

10 ECCV-14 submission ID 1534

Most (Predicted) Least (Predicted)

Pin

up

Goth

H

ipste

r B

ohe

mia

n

Pre

ppy

Fig. 5: Example results of within-classification task with δ = 0.5. Top and bottompredictions for each style category are shown.

5. 2 W ith in -class classifi cat ion

Our next style recognition tasks considers classification between top rated andbottom rated examples for each style independently. Here the goal is, for ex-ample, to determine whether a person is an uber-hipster or only sort of hipster.Again, we utilize linear SVM s [27], but here learn one visual model for each stylesin our dataset. Here δ determines the percentage of top and bottom ranked im-ages used in the classification task. For example, δ = 0.1 means that we usethe top rated 10% of images from a style as positive samples and the bottomrated 10% of samples from the same style as negative samples (using the ratingscomputed in Sec 3.2) . W e evaluate experiments for δ ranging from 10% to 50% .W e repeat the experiments for 100 random folds with a 9 : 1 train to test ratio.I n each experiment, C , is determined using 5 fold cross-validation.Results are reported in F igure 6. W e observe that when δ is small we generally

have better performance than for larger δ. T his is because the classification taskgenerally becomes more challenging as we add less extreme examples of eachstyle. Additionally, we find best performance on the pinup category. Performanceon the goth category comes in second. For the hipster category, we do quite wellat di↵erentiating between extremely strong or weak examples, but performance

MOST LEAST

High-quality dataset without Amazon MTurk

Relative vs. absolute

• Asked MTurk workers 1-10 ratings

• Much noisier results from MTurk

Analyzing what makes her look preppy

Factorization results

Fashion style analysis

• Game-based annotation collected high-quality data without monetary rewards

• How can we collect seed images?

STUDYING FASHION TRENDS WACV2015

Fashion trend: Runway to realway

Fashion show Street

Runway dataset ~35k images in 9k fashion shows over 15 years, from 2000 to 2014

Brands by photos

101

102

103

0

2

4

6

8

10

12

Photos

Bra

nds

The query image is given in the left column, while five candidate images are shown in the right columns.

1. Select an image with the most similar outfit to the query. 2. If there is NO similar image, please select NONE.

Query image

NONE

Collecting human judgments to learn similarity

Select an image with the most similar outfit to the query image

Runway-to-runway retrieval Retrieving similar styles from other fashion shows

Runway-to-realway retrieval Retrieving similar styles from street snaps

Visually analyzing floral trend Runway image of floral Retrieved images in street with timestamp

Peaks in spring!

% retrieved images

Part II: Case studies

• Vital roles of data – Data-driven clothing parsing

• Small complete annotations, large-scale tags – Popularity analysis

• Verifying network phenomena using crowds – Studying fashion styles

• Game-based data collection – Studying fashion trends

• Learning human judgments to analyze trend

Effective dataset construction

Crowd-sourced datasets

Scale

Quality

1M 100K 10K 1K 10M 100M 1B

Raw, user-generated data

In-house datasets

10B 100

Driving force to computer vision

Wisdom of crowds

Recommended