112
Effective Dataset Construction in Computer Vision Kota Yamaguchi

Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Embed Size (px)

Citation preview

Page 1: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Effective Dataset Construction in Computer Vision

Kota Yamaguchi

Page 2: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Recent progress in image recognition

0

5

10

15

20

25

30

2010 2011 2012 2013 2014

[Russakovsky 2014]

GoogLeNet 6.7%

Clarifai 11.7%

SuperVision 16.4%

XRCE 25.8%

NEC 28.2%

Human 5.1%

Ioffe et al. (arXiv) 4.9%

ILSVRC!image!classifica1on!task!

Steel!drum!

Scale T-shirt Steel drum Drumstick Mud turtle

ILSVRC classification error

Deep models and ...

Page 3: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Int J Comput Vis

Fig. 1 The best reported performance on PASCAL VOC challenge has

shown marked increases since 2006 (top). This could be due to various

factors: the dataset itself has evolved over time, the best-performing

methods differ across years, etc. In the bottom-row, we plot a particular

factor—training data size—which appears to correlate well with per-

formance. This begs the question: has the increase been largely driven

from the availability of larger training sets?

Fig. 2 We plot idealized curves of performance versus training dataset

size and model complexity. The effect of additional training examples

is diminished as the training dataset grows (left), while we expect per-

formance to grow with model complexity up to a point, after which an

overly-flexible model overfits the training dataset (right). Both these

notions can be made precise with learning theory bounds, see e.g.

(McAllester 1999)

1.1 Challenges

We found there is a surprising amount of subtlety in scaling

up training data sets in current systems. For a fixed model,

one would expect performance to generally increase with the

amount of data and eventually saturate (Fig. 2). Empirically,

we often saw the bizarre result that off-the-shelf implemen-

tations show decreased performance with additional data!

One would also expect that to take advantage of additional

training data, it is necessary to grow the model complexity,

in this case by adding mixture components to capture dif-

ferent object sub-categories and viewpoints. However, even

with non-parametric models that grow with the amount of

training data, we quickly encountered diminishing returns in

performance with only modest amounts of training data.

We show that the apparent performance ceiling is not

a consequence of HOG+linear classifiers. We provide an

analysis of the popular deformable part model (DPM), show-

ing that it can be viewed as an efficient way to implicitly

encode and score an exponentially-large set of rigid mixture

components with shared parameters. With the appropriate

sharing, DPMs produce substantial performance gains over

standard non-parametric mixture models. However, DPMs

have fixed complexity and still saturate in performance with

current amounts of training data, even when scaled to mix-

tures of DPMs. This difficulty is further exacerbated by the

computational demands of non-parametric mixture models,

which can be impractical for many applications.

1.2 Proposed Solutions

In this paper, we offer explanations and solutions for many

of these difficulties. First, we found it crucial to set model

regularization as a function of training dataset using cross-

validation, a standard technique which is often overlooked

in current object detection systems. Second, existing strate-

gies for discovering sub-category structure, such as cluster-

ing aspect ratios (Felzenszwalb et al. 2010), appearance fea-

tures (Divvala et al. 2012), and keypoint labels (Bourdev and

Malik 2009) may not suffice. We found this was related to

the inability of classifiers to deal with “polluted” data when

mixture labels were improperly assigned. Increasing model

complexity is thus only useful when mixture components

capture the “right” sub-category structure.

To efficiently take advantage of additional training data,

we introduce a non-parametric extension of a DPM which we

call an exemplar deformable part model (EDPM). Notably,

EDPMs increase the expressive power of DPMs with only a

negligible increase in computation, making them practically

useful. We provide evidence that suggests that compositional

representations of mixture templates provide an effective way

to help target the “long-tail” of object appearances by sharing

local part appearance parameters across templates.

Extrapolating beyond our experiments, we see the strik-

ing difference between classic mixture models and the non-

parametric compositional model (both mixtures of linear

classifiers operating on the same feature space) as evidence

that the greatest gains in the near future will not be had with

simple models + bigger data, but rather through improved

representations and learning algorithms.

We introduce our large-scale dataset in Sect. 2, describe

our non-parametric mixture models in Sect. 3, present exten-

sive experimental results in Sect. 4, and conclude with a dis-

cussion in Sect. 5 including related work.

123

X Zhu et al. Do We Need More Training Data? IJCV 2015

PASCAL VOC best performance

Data improvement? Model improvement?

We need both

Page 4: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Data drive statistical models

Training data Model

Testing data Results

Training

Testing

e.g., CNN

Page 5: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Object recognition datasets

2014

• ImageNet • Stanford Background

• SUN

2009 2004

• Caltech 101 • MSRC • ESP Game

• MS COCO • YFCC 100M

• TinyImage

WordNet (1985-)

• PASCAL VOC • Caltech 256 • LabelMe

• SBU1M

• UIUC

Page 6: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

ImageNet • WordNet Hierarchy

• 14M images, 21K

synsets as of Apr 2015

• Used in ILSVRC

[Deng 2009]

Page 7: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Microsoft COCO • Over 300K

images, 2M instances

• Creative Commons

• Segmentation • 5 captions /

image

[Lin 2014]

Page 8: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Quality vs. scale

Scale

Quality

1M 100K 10K 1K 10M 100M 1B

In-house datasets

10B 100

Raw, user-generated data

• SBU1M • YFCC100M

Crowd-sourced datasets

• MS COCO

• ImageNet

• SUN • Caltech 101

Page 9: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Motivation

• Data is driving image recognition, but creating a good dataset is not easy

• Big-data challenges – Scalability – Quality

• How should we construct a dataset?

Page 10: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Agenda

Part I: dataset construction • Dataset construction • Collecting data • Annotating data • Crowdsourcing

• 10-min break

Part II: case studies • Data-driven clothing

parsing • Popularity analysis • Studying fashion styles • Studying fashion trends

Page 11: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Dataset construction

1. Decide the task – Classification, Detection, Segmentation, ...

2. Collecting data – Web, Fieldwork

3. Select and annotate data – Crowdsourcing

4. Your dataset is ready for use

Page 12: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

To annotate, or not to annotate?

• If your task is ... – Supervised approach – Benchmarking

• If your task is ...

– Data mining – Weakly supervised approach

Need only minor annotations

Need a lot of annotations

Page 13: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Supervised scenario: classification

• Image-label pairs – Every picture must

be completely annotated

– Very clean CIFAR-10 dataset [Krizhevsky 2009]

D = x, y( ){ }

yÎ bird, cat, dog, ...{ }

Page 14: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Weakly-supervised scenario: tag prediction

• Image and user-attached tags – No annotation effort – Often missing tags – Not necessarily visual

D = x, z( ){ }zÎ canon, USA, vintage, ...{ }

Boulder, Colorado, city, historic, history, America, United States, urban, street, vintage, historical, ephemeral, classic, retro, brick, sign, signage, tavern, restaurant, cafe, dining, building, nostalgic, nostalgia, old, wall, door, window, Canon, architecture, Southwest

https://www.flickr.com/photos/29069717@N02/16772466913/

Page 15: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Dataset construction

• Purpose of the dataset greatly differs depending on the goal

• Be ambitious! – ImageNet [Deng et al, 2009]

• From WordNet to visual ontology – Visipedia [Perona et al, 2010]

• Constructing visual encyclopedia

Page 16: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

COLLECTING VISUAL DATA

Page 17: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Collecting data

• Web • Fieldwork

Page 18: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Collecting data on the Web

• Approaches – Search Engine: Google, Bing – Web API: Flickr, SNS – Web scraping

• Considerations – Legal issues – Noise and distribution of online content – Storage

Page 19: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Keyword-based search

• Good for weakly-categorized images • Issues

– Data-size limitation – Variability

whippet

Page 20: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Query expansion [Deng 2009]

whippet dog whippet greyhound

whippet lebrel 惠比特犬

Translations

Synonyms

whippet

Page 21: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Web API

• Web-services sometimes provide a developer API

• Structured data – e.g., JSON, XML

www.flickr.com/services/developer

Page 22: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Web scraper

WWW

HTTP Client Parser

URL Queue Storage URLs

URLs Data

Page

Page 23: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Legal issues

• CAREFULLY READ TERMS – Service providers don’t like bad access

• especially if it harms their business • e.g., copying the entire website

– Users own copyright on their own content – Talk to an expert if unsure

• Recommendation

– Creative Commons (Flickr) – Citing URL instead of redistributing data

Page 24: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

YFCC100M Yahoo Flickr Creative Commons 100M

http://yahoolabs.tumblr.com/post/89783581601/one-hundred-million-creative-commons-flickr-images

[Thomee et al, 2015]

• 100M Flickr Photos

• 49M geo-tagged • Image URLs • Title and Description • Tags

Could be used as a basis to build a new dataset upon

Page 25: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Quality of online data

Page 26: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Flickr description != caption

Vacation on the water One week vacation in the blue waters of Turkey was one of the best weeks in my life. On day in each bay just worrying about the sun and the water. One week without putting on shoes or using the phone. Paradise on earth!

https://www.flickr.com/photos/rspedro/8396863230/

Page 27: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Learning from online content

• User-generated content does not contain clean data – Non-visual texts / tags – Tags tend to have high precision, low recall – Frequency issue

• Hopefully, large data-size resolves issues

Page 28: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Bigger data help – Retrieval from SBU1M dataset

[Ordonez 2011]

Im2Text: Describing Images Using 1 Million

Captioned Photographs Vicente Ordonez, Girish Kulkarni, Tamara L. Berg

Stony Brook University

Method overview

Contributions

Method BLEU score

Global matching (1k) 0.0774 +- 0.0059

Global matching (10k) 0.0909 +- 0.0070

Global matching (100k) 0.0917 +- 0.0101

Global matching (1million) 0.1177 +- 0.0099

Global + Content matching (linear regression) 0.1215 +- 0.0071

Global + Content matching (linear SVM) 0.1259 +- 0.0060

High level information

BLEU score evaluation

• SBU Captioned Photo Dataset: A large novel data set containing 1 million

images from the web with associated captions written by people, filtered so that

the descriptions are likely to refer to visual content.

[http://tamaraberg.com/sbucaptions]

• A description generation method that utilizes global image representations to

retrieve and transfer captions from our data set to a query image.

• A description generation method that utilizes both global representations and

direct estimates of image content (objects, actions, stuff, attributes, and

scenes) to produce relevant image descriptions.

Dataset size

Good results

Amazing colours in

the sky at sunset with

the orange of the

cloud and the blue of

the sky behind.

Strange cloud formation literally flowing

through the sky like a river in relation to

the other clouds out there.

Fresh fruit and vegetables

at the market in Port

Louis Mauritius.

Clock tower

against the sky. Tree with red leaves in the field in

autumn.

One monkey on the tree in the

Ourika Valley Morocco A female mallard duck in the lake

at Luukki Espoo

The sun was

coming through

the trees while I

was sitting in my

chair by the river

Query Image

1,000 10,000 100,000 1,000,000

Our dog Zoe in her bed.

Interior design of modern white

and brown living room furniture

against white wall with a lamp

hanging.

The Egyptian cat statue by the

floor clock and perpetual motion

machine in the pantheon.

Man sits in a rusted car buried in

the sand on Waitarere beach.

Emma in her hat looking super

cute.

t i d ig f d hitLittle girl and her dog in

northern Thailand. They both

seemed interested in what

we were doing.

Past work on image

retrieval has shown that

small collections often

produce spurious

matches. Increasing data

set size has a significant

effect on the quality of

retrieved global matches.

Quantitative results also

reflect this (see table at

the bottom)

Kentucky cows in a field.

The cat in the window.

The sky is blue over the Gherkin. The boat ended up a kilometre

from the water in the middle of

the airstrip.

Tree beside the river. Water over the road.

Bad results Incorrect objects Incorrect context

Completely wrong

Human evaluation

Objects: 80 object categories using part-based deformable models and

compute distances with objects detected in the query image based on visual

attributes and raw visual descriptors.

Stuff: Detect stuff regions using a sliding window SVM scoring function with texton, color

and geometric features as input. We determine similarity with the query image using

product of SVM probabilities. (water, etc)

People/Actions: Detect people and pose using state-of-the-art methods and

compute person similarity using an attribute based representation of pose.

Scenes: Train classifiers using global features for 26 common scene types and use the

vector of classifier responses as a feature to compute similarity between images.

TFIDF: Rank the words in the returned set of image captions using their term-frequency inverse document frequency scores and follow a similar approach with the

keywords for each object detection in the matching image set. As a result we obtain text-based TFIDF scores and object-detection-based TFIDF scores.

Caption used Success rate

Original human caption 96.0%

Top caption 66.7%

Best from our top 4 captions 92.7%

SBU Captioned Photo Dataset

Under the sky of burning clouds. Stained glass

window in

Eusebius church.

From the big tower on the hill over

looking central Wakkanai.

Reflection of the clear blue sky in

the water.

An old roman wall by the

tower of London. A tree right around

the corner from our

house is this tree

after the snow fell it

was so beautiful.

Young baboon in

the campsite at

fish river canyon.

Not quite sure what the

name of this bird is. Saw

while walking along the

beach in Ocracoke, NC

Granite in green

glass. This is a boat I saw

while walking near the

house we rented.

Graffiti water

tower in Sidney,

Ohio.

Looks like this might have

been RCA building you can

still see the RCA dog in the

stained glass window

The old Premium

Oil Co. sign in

Green River, Utah.

Chimaki black floral

cosmetic bag with

handle.

Graffiti water

tower in Sidney,

Ohio.

The water tower

in downtown

Campbell.

The famous Liver bird atop

the Royal Liverpool

Insurance building near the

newly tarted up docks.

Evan playing in

the sand on a

calais beach in

France.

This is the old water

tower at the

Goodyear plant in

Cartersville, Georgia.

My house...yeah right. This

was the beach house we

stayed in with my family for

vacation, in the Outer Banks.

The water tower

in downtown

Campbell.

Old cone water

tower with Graffiti

in Detroit

Michigan

Matching using Global

Image Features

(GIST + Color)

Smallest house in paris

between red (on right)

and beige (on left).

Bridge to temple in

Hoan Kiem lake.

The water is

clear enough to

see fish

swimming

around in it.

A walk around the

lake near our house

with Abby.

Hangzhou bridge in

West lake.

The daintree river by

boat.

. . .

SBU Captioned Photo Dataset

Transfer Caption(s)

e.g. “The water is clear

enough to see fish

swimming around in it.”

1 million captioned images!

The bridge over the

lake on Suzhou Street.

The Daintree river by boat. Bridge over Cacapon river.

Iron bridge over the Duck river.

. . .

Transfer Caption(s)

e.g. “The bridge over the lake

on Suzhou Street.”

Rerank retrieved images using high level content (captions, object detections,

scene classification, stuff detections, people & actions)

sky

trees

water

building

bridge

One of the many stone bridges in town

that carry the gravel carriage roads.

An old bridge over dirty green water.

A stone bridge over a peaceful river.

Computer Vision

Our Goal

The view from the 13th floor of an

apartment building in Nakano awesome.

Please choose the

image that better

corresponds to the

given caption:

In addition, we propose a new evaluation task where a user is presented with two photographs

and one caption. The user must assign the caption to the most relevant image. For evaluation we

use a query image, a random image and a generated caption.

Im2Text: Describing Images Using 1 Million

Captioned Photographs Vicente Ordonez, Girish Kulkarni, Tamara L. Berg

Stony Brook University

Method overview

Contributions

Method BLEU score

Global matching (1k) 0.0774 +- 0.0059

Global matching (10k) 0.0909 +- 0.0070

Global matching (100k) 0.0917 +- 0.0101

Global matching (1million) 0.1177 +- 0.0099

Global + Content matching (linear regression) 0.1215 +- 0.0071

Global + Content matching (linear SVM) 0.1259 +- 0.0060

High level information

BLEU score evaluation

• SBU Captioned Photo Dataset: A large novel data set containing 1 million

images from the web with associated captions written by people, filtered so that

the descriptions are likely to refer to visual content.

[http://tamaraberg.com/sbucaptions]

• A description generation method that utilizes global image representations to

retrieve and transfer captions from our data set to a query image.

• A description generation method that utilizes both global representations and

direct estimates of image content (objects, actions, stuff, attributes, and

scenes) to produce relevant image descriptions.

Dataset size

Good results

Amazing colours in

the sky at sunset with

the orange of the

cloud and the blue of

the sky behind.

Strange cloud formation literally flowing

through the sky like a river in relation to

the other clouds out there.

Fresh fruit and vegetables

at the market in Port

Louis Mauritius.

Clock tower

against the sky. Tree with red leaves in the field in

autumn.

One monkey on the tree in the

Ourika Valley Morocco A female mallard duck in the lake

at Luukki Espoo

The sun was

coming through

the trees while I

was sitting in my

chair by the river

Query Image

1,000 10,000 100,000 1,000,000

Our dog Zoe in her bed.

Interior design of modern white

and brown living room furniture

against white wall with a lamp

hanging.

The Egyptian cat statue by the

floor clock and perpetual motion

machine in the pantheon.

Man sits in a rusted car buried in

the sand on Waitarere beach.

Emma in her hat looking super

cute.

t i d ig f d hitLittle girl and her dog in

northern Thailand. They both

seemed interested in what

we were doing.

Past work on image

retrieval has shown that

small collections often

produce spurious

matches. Increasing data

set size has a significant

effect on the quality of

retrieved global matches.

Quantitative results also

reflect this (see table at

the bottom)

Kentucky cows in a field.

The cat in the window.

The sky is blue over the Gherkin. The boat ended up a kilometre

from the water in the middle of

the airstrip.

Tree beside the river. Water over the road.

Bad results Incorrect objects Incorrect context

Completely wrong

Human evaluation

Objects: 80 object categories using part-based deformable models and

compute distances with objects detected in the query image based on visual

attributes and raw visual descriptors.

Stuff: Detect stuff regions using a sliding window SVM scoring function with texton, color

and geometric features as input. We determine similarity with the query image using

product of SVM probabilities. (water, etc)

People/Actions: Detect people and pose using state-of-the-art methods and

compute person similarity using an attribute based representation of pose.

Scenes: Train classifiers using global features for 26 common scene types and use the

vector of classifier responses as a feature to compute similarity between images.

TFIDF: Rank the words in the returned set of image captions using their term-frequency inverse document frequency scores and follow a similar approach with the

keywords for each object detection in the matching image set. As a result we obtain text-based TFIDF scores and object-detection-based TFIDF scores.

Caption used Success rate

Original human caption 96.0%

Top caption 66.7%

Best from our top 4 captions 92.7%

SBU Captioned Photo Dataset

Under the sky of burning clouds. Stained glass

window in

Eusebius church.

From the big tower on the hill over

looking central Wakkanai.

Reflection of the clear blue sky in

the water.

An old roman wall by the

tower of London. A tree right around

the corner from our

house is this tree

after the snow fell it

was so beautiful.

Young baboon in

the campsite at

fish river canyon.

Not quite sure what the

name of this bird is. Saw

while walking along the

beach in Ocracoke, NC

Granite in green

glass. This is a boat I saw

while walking near the

house we rented.

Graffiti water

tower in Sidney,

Ohio.

Looks like this might have

been RCA building you can

still see the RCA dog in the

stained glass window

The old Premium

Oil Co. sign in

Green River, Utah.

Chimaki black floral

cosmetic bag with

handle.

Graffiti water

tower in Sidney,

Ohio.

The water tower

in downtown

Campbell.

The famous Liver bird atop

the Royal Liverpool

Insurance building near the

newly tarted up docks.

Evan playing in

the sand on a

calais beach in

France.

This is the old water

tower at the

Goodyear plant in

Cartersville, Georgia.

My house...yeah right. This

was the beach house we

stayed in with my family for

vacation, in the Outer Banks.

The water tower

in downtown

Campbell.

Old cone water

tower with Graffiti

in Detroit

Michigan

Matching using Global

Image Features

(GIST + Color)

Smallest house in paris

between red (on right)

and beige (on left).

Bridge to temple in

Hoan Kiem lake.

The water is

clear enough to

see fish

swimming

around in it.

A walk around the

lake near our house

with Abby.

Hangzhou bridge in

West lake.

The daintree river by

boat.

. . .

SBU Captioned Photo Dataset

Transfer Caption(s)

e.g. “The water is clear

enough to see fish

swimming around in it.”

1 million captioned images!

The bridge over the

lake on Suzhou Street.

The Daintree river by boat. Bridge over Cacapon river.

Iron bridge over the Duck river.

. . .

Transfer Caption(s)

e.g. “The bridge over the lake

on Suzhou Street.”

Rerank retrieved images using high level content (captions, object detections,

scene classification, stuff detections, people & actions)

sky

trees

water

building

bridge

One of the many stone bridges in town

that carry the gravel carriage roads.

An old bridge over dirty green water.

A stone bridge over a peaceful river.

Computer Vision

Our Goal

The view from the 13th floor of an

apartment building in Nakano awesome.

Please choose the

image that better

corresponds to the

given caption:

In addition, we propose a new evaluation task where a user is presented with two photographs

and one caption. The user must assign the caption to the most relevant image. For evaluation we

use a query image, a random image and a generated caption.

Im2Text: Describing Images Using 1 Million

Captioned Photographs Vicente Ordonez, Girish Kulkarni, Tamara L. Berg

Stony Brook University

Method overview

Contributions

Method BLEU score

Global matching (1k) 0.0774 +- 0.0059

Global matching (10k) 0.0909 +- 0.0070

Global matching (100k) 0.0917 +- 0.0101

Global matching (1million) 0.1177 +- 0.0099

Global + Content matching (linear regression) 0.1215 +- 0.0071

Global + Content matching (linear SVM) 0.1259 +- 0.0060

High level information

BLEU score evaluation

• SBU Captioned Photo Dataset: A large novel data set containing 1 million

images from the web with associated captions written by people, filtered so that

the descriptions are likely to refer to visual content.

[http://tamaraberg.com/sbucaptions]

• A description generation method that utilizes global image representations to

retrieve and transfer captions from our data set to a query image.

• A description generation method that utilizes both global representations and

direct estimates of image content (objects, actions, stuff, attributes, and

scenes) to produce relevant image descriptions.

Dataset size

Good results

Amazing colours in

the sky at sunset with

the orange of the

cloud and the blue of

the sky behind.

Strange cloud formation literally flowing

through the sky like a river in relation to

the other clouds out there.

Fresh fruit and vegetables

at the market in Port

Louis Mauritius.

Clock tower

against the sky. Tree with red leaves in the field in

autumn.

One monkey on the tree in the

Ourika Valley Morocco A female mallard duck in the lake

at Luukki Espoo

The sun was

coming through

the trees while I

was sitting in my

chair by the river

Query Image

1,000 10,000 100,000 1,000,000

Our dog Zoe in her bed.

Interior design of modern white

and brown living room furniture

against white wall with a lamp

hanging.

The Egyptian cat statue by the

floor clock and perpetual motion

machine in the pantheon.

Man sits in a rusted car buried in

the sand on Waitarere beach.

Emma in her hat looking super

cute.

t i d ig f d hitLittle girl and her dog in

northern Thailand. They both

seemed interested in what

we were doing.

Past work on image

retrieval has shown that

small collections often

produce spurious

matches. Increasing data

set size has a significant

effect on the quality of

retrieved global matches.

Quantitative results also

reflect this (see table at

the bottom)

Kentucky cows in a field.

The cat in the window.

The sky is blue over the Gherkin. The boat ended up a kilometre

from the water in the middle of

the airstrip.

Tree beside the river. Water over the road.

Bad results Incorrect objects Incorrect context

Completely wrong

Human evaluation

Objects: 80 object categories using part-based deformable models and

compute distances with objects detected in the query image based on visual

attributes and raw visual descriptors.

Stuff: Detect stuff regions using a sliding window SVM scoring function with texton, color

and geometric features as input. We determine similarity with the query image using

product of SVM probabilities. (water, etc)

People/Actions: Detect people and pose using state-of-the-art methods and

compute person similarity using an attribute based representation of pose.

Scenes: Train classifiers using global features for 26 common scene types and use the

vector of classifier responses as a feature to compute similarity between images.

TFIDF: Rank the words in the returned set of image captions using their term-frequency inverse document frequency scores and follow a similar approach with the

keywords for each object detection in the matching image set. As a result we obtain text-based TFIDF scores and object-detection-based TFIDF scores.

Caption used Success rate

Original human caption 96.0%

Top caption 66.7%

Best from our top 4 captions 92.7%

SBU Captioned Photo Dataset

Under the sky of burning clouds. Stained glass

window in

Eusebius church.

From the big tower on the hill over

looking central Wakkanai.

Reflection of the clear blue sky in

the water.

An old roman wall by the

tower of London. A tree right around

the corner from our

house is this tree

after the snow fell it

was so beautiful.

Young baboon in

the campsite at

fish river canyon.

Not quite sure what the

name of this bird is. Saw

while walking along the

beach in Ocracoke, NC

Granite in green

glass. This is a boat I saw

while walking near the

house we rented.

Graffiti water

tower in Sidney,

Ohio.

Looks like this might have

been RCA building you can

still see the RCA dog in the

stained glass window

The old Premium

Oil Co. sign in

Green River, Utah.

Chimaki black floral

cosmetic bag with

handle.

Graffiti water

tower in Sidney,

Ohio.

The water tower

in downtown

Campbell.

The famous Liver bird atop

the Royal Liverpool

Insurance building near the

newly tarted up docks.

Evan playing in

the sand on a

calais beach in

France.

This is the old water

tower at the

Goodyear plant in

Cartersville, Georgia.

My house...yeah right. This

was the beach house we

stayed in with my family for

vacation, in the Outer Banks.

The water tower

in downtown

Campbell.

Old cone water

tower with Graffiti

in Detroit

Michigan

Matching using Global

Image Features

(GIST + Color)

Smallest house in paris

between red (on right)

and beige (on left).

Bridge to temple in

Hoan Kiem lake.

The water is

clear enough to

see fish

swimming

around in it.

A walk around the

lake near our house

with Abby.

Hangzhou bridge in

West lake.

The daintree river by

boat.

. . .

SBU Captioned Photo Dataset

Transfer Caption(s)

e.g. “The water is clear

enough to see fish

swimming around in it.”

1 million captioned images!

The bridge over the

lake on Suzhou Street.

The Daintree river by boat. Bridge over Cacapon river.

Iron bridge over the Duck river.

. . .

Transfer Caption(s)

e.g. “The bridge over the lake

on Suzhou Street.”

Rerank retrieved images using high level content (captions, object detections,

scene classification, stuff detections, people & actions)

sky

trees

water

building

bridge

One of the many stone bridges in town

that carry the gravel carriage roads.

An old bridge over dirty green water.

A stone bridge over a peaceful river.

Computer Vision

Our Goal

The view from the 13th floor of an

apartment building in Nakano awesome.

Please choose the

image that better

corresponds to the

given caption:

In addition, we propose a new evaluation task where a user is presented with two photographs

and one caption. The user must assign the caption to the most relevant image. For evaluation we

use a query image, a random image and a generated caption.

Page 29: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Power-laws

Limited vocabulary appears extremely large number of times

Most of the words are rare

frequency

Long tail

figure: wikipedia

• Frequency of tag words • Content popularity

f (x) = a xk

Family of distributions of the form:

Page 30: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Collecting data in the field

• Full control over data – Sensor types

• RGB-D, Panorama

– Quality • No copyright issue

• Cost and scalability

Page 31: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Data collection summary

• Web, fieldwork, building on existing dataset

• Legal concerns • Quality issues

• Probably bigger is better

– Deep learning requires big data

Page 32: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

ANNOTATING DATA

Page 33: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Annotation process

D = x, y( ){ }

Output: Annotated data

¢D = x, z( ){ }

Input: Raw data

Weak labels • Search-keywords • Tags • GPS (?)

Clean labels • Image-labels • Bounding-boxes • Pixel-labels

Annotation system

Image

Page 34: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Designing annotation tools Types of annotations • Category tag • Bounding box

– Human pose • Segmentation

– Polygon – Super-pixels

• Natural language • Attributes • Tracking

Tools • HTML / JavaScript • Web server to host

images

Page 35: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Bounding boxes

Page 36: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Segmentation

• Pixel-wise labels • Approximation

– polygons – super-pixels

LabelMe [Russell 2007]

Page 37: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Natural language

• Image-text pairs

• Decomposable with NLP techniques – Attribute: Adj + Noun – Action: Noun + Verb

• Sentence generation

• One jet lands at an airport while another

takes off next to it.

• Two airplanes parked in an airport.

• Two jets taxi past each other.

• Two parked jet airplanes facing opposite

directions.

• two passenger planes on a grassy plain

UIUC Pascal Sentence [Rashtchian 2010]

Page 38: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Relative attributes [Parikh 2011]

> natural

< smiling

Slide credit: Devi Parikh

Is this natural?

Page 39: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Object annotation in videos

Vatic [Vondrick 2012]

Page 40: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Choosing the right task

• The more difficult the task is, the more expensive annotation becomes – (worker) time is money – Very difficult task results in poor quality

• Decompose a very complicated task into

multiple simple tasks – e.g., Single task to detect ALL objects ->

Multiple tasks to detect specific objects

Page 41: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Scaling up multi-label annotation

• Goal: Efficiently labeling hundreds of categories

Table Chair Horse Dog Cat Bird ...

+ + - - - - + - - - + - + + - - - -

[Deng 2014]

~1000 ?

Page 42: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Hierarchy, sparsity, correlation [Deng 2014]

Is there a table?

Is there a chair?

Is there a horse?

Is there a dog?

Is there a cat?

Is there a bird?

Naively asking 1000 labels

Is there an animal?

Is there a mammal?

Is there a cat?

Hierarchical questions

No table, chair

Table Chair Horse Dog Cat Bird

No bird

Probably no horse?

Page 43: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Human-in-the-loop approach

• Active learning – Only annotating

uncertain instances – Lower costs – Faster learning

Unannotated images

Annotated images

Selection Model

Selected images

Annotators

[Vijayanarasimhan 2011]

Page 44: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Pitfall: Asking people for validation

Is this an airplane? • Answer yes if a green

rectangle is drawn around an airplane. Otherwise answer no.

Rule-of-thumb: Ask to annotate ground-truth

yes no

Machines are very unlikely to produce 100% correct detection

Page 45: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Crowdsourcing market

Online workers

$$$

Requester

Tasks

• Image classification • Object detection • Segmentation • Language

description

Result

Monetary rewards

Amazon Mechanical Turk, CrowdFlower, etc.

Non-US people probably need somebody in US or agents to use MTurk as of 2015...

Page 46: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Amazon MTurk demographics

• Country: 80% US, 20% India

• Gender: 50% Male, 50% Female

• Age: – 50% 30 years old – 20% 20 years old – 20% 40 years old

P Ipeirotis, Demographics of Mechanical Turk: Now Live! (April 2015 edition) http://www.behind-the-enemy-lines.com

http://blogs.scientificamerican.com/guilty-planet/

Page 47: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Workers are not the same P Welinder et al., The Multidimensional Wisdom of Crowds, NIPS 2010

• One annotator = one classifier for ``duck’’ presence • Estimated decision parameters from Bayesian model • Groups 1, 2, 3 have different decision boundary

Page 48: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Quality control

• There are always sloppy annotators – Think of a bot randomly clicking on buttons

• Have a qualification test • Insert JavaScript to validate answers

– Reject too short or fast answers • Assign multiple annotators per task • Control worker motivation

– Feedback, Gamification

Page 49: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Qualification tests

• MTurk can be set up to allow only workers passing a qualification test – Prepare test questions and gold-

standard answers – Useful to assess, e.g., writing

ability

• Also possible to validate workers during the main tasks

Annotation tasks

Qualification test

Good Bad

Page 50: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Giving feedback to workers

• Feedback motivates workers – Also include a comment form to get their

opinion

100% 0% 50%

You’re the rookie! You’ve annotated 20 images You have only 20 images left!

Page 51: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Game-based annotation

• ESP Game

• Pros – No MTurk – Motivation

• Cons – Cheating – Bias

many.corante.com

[Ahn 2004]

Page 52: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

ReferItGame [Kazemzadeh 2014]

guy in front

guy in front

Player 1

Player 2 man in red shirt

man in red shirt

Write a referring expression

Click on the referred object

Page 53: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Finding experts on the Web • Quizz

– Free medical quiz using targeted ads

– Knowledgeable volunteers, without monetary rewards

– Much faster with better quality

• Do monetary rewards harm quality?

[Ipeirotis 2014]

Page 54: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Crowdsourcing considerations

• Know your workers

• Quality control – Fun tasks attract motivated workers!

Page 55: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Part I: dataset construction

• What is your dataset for? – Know your task

• Collecting data – On Web or fields, Quality and big data

• Annotating data – Designing the right tool

• Crowdsourcing – Workers and quality control

Page 56: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Part II: Case studies

• Data-driven clothing parsing • Popularity analysis • Studying fashion styles • Studying fashion trends

Page 57: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

DATA-DRIVEN CLOTHING PARSING

CVPR 2012, ICCV 2013

Page 58: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

style.com

Clothing parsing

Page 59: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Pose and clothing

Semantic segmentation Pose estimation [Shotton 06] [Gould 09] [Liu 09] [Eigen 12]

[Singh 13] [Tighe 10, 13, 14] [Dong 13] [Ferrari 08] [Bourdev 09] [Yang 11]

[Dantone 13] [Ladicky 13]

Page 60: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Online fashion networks

Chictopia Lookbook Chicisimo Pinterest Tumblr ...

www.chictopia.com

Page 61: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Datasets

• Fashionista dataset – Small, completely annotated images – For supervised learning

• Paper Doll dataset

– Large-scale tagged images – For semi-supervised approach

Page 62: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Fashionista dataset

• 685 images – pose annotation – super-pixel labels

• Manually picked

images from Chictopia

• Crowd-sourced annotation

(a) Superpixels (b) Pose estimation

null

shorts

shoes

purse

top

necklace

hair

skin

(c) Predicted Clothing Parse (d) Pose re-estimation

Figure 2: Clothing parsing pipeline: (a) Parsing the image into Superpixels [1], (b) Original pose estimation using state ofthe art flexible mixtures of parts model [27]. (c) Precise clothing parse output by our proposed clothing estimation model(note the accurate labeling of items as small as the wearer’s necklace, or as intricate as her open toed shoes). (d) Optional re-estimate of pose using clothing estimates (note the improvement in her left arm prediction, compared to the original incorrectestimate down along the side of her body).

garment retrieval application (Fig 1).

Our main contributions include:• A novel dataset for studying clothing parsing, consist-ing of 158,235 fashion photos with associated text an-notations, and web-based tools for labeling.

• An effective model to recognize and precisely parsepictures of people into their constituent garments.

• Initial experiments on how clothing prediction mightimprove state of the art models for pose estimation.

• A prototype visual garment retrieval application thatcan retrieve matches independent of pose.Of course, clothing estimation is a very challenging

problem. The number of garment types you might observein a day on the catwalk of a New York city street is enor-mous. Add variations in pose, garment appearance, lay-ering, and occlusion into the picture, and accurate cloth-ing parsing becomes formidable. Therefore, we considera somewhat restricted domain, fashion photos from Chic-topia.com. These highly motivated users – fashionistas –upload individual snapshots (often full body) of their outfi tsto the website and usually provide some information relatedto the garments, style, or occasion for the outfi t. This allowsus to consider the clothing labeling problem in two scenar-ios: 1) a constrained labeling problem where we take theusers’noisy and perhaps incomplete tags as the list of pos-sible garment labels for parsing, and 2) where we considerall garment types in our collection as candidate labels.

1.1. Related W ork

Clothing recognition: Though clothing items determinemost of the surface appearance of the everyday human,there have been relatively few attempts at computationalrecognition of clothing. Early clothing parsing attempts fo-cused on identifying layers of upper body clothes in very

limited situations [2]. Later work focused on grammati-cal representations of clothing using artists’ sketches [6].Freifeld and Black [13] represented clothing as a defor-mation from an underlying body countour, learned fromtraining examples using principal component analysis toproduce eigen-clothing. M ost recently attempts have beenmade to consider clothing items such as t-shirt or jeans assemantic attributes of a person, but only for a limited num-ber of garments [4]. Different from these past approaches,we consider the problem of estimating a complete and pre-cise region based labeling of a person’s outfi t, for generalimages with a large number of potential garment types.

Clothing items have also been used as implicit cues ofidentity in surveillance scenarios [26], to find people in animage collection of an event [11, 22, 25], to estimate occu-pation [23], or for robot manipulation [16]. Our proposedapproach could be useful in all of these scenarios.

Pose Estimation: Pose estimation is a popular and wellstudied enterprise. Some previous approaches have con-sidered pose estimation as a labeling problem, assigningmost likely body parts to superpixels [18], or triangulatedregions [20]. Current approaches often model the body asa collection of small parts and model relationships amongthem, using conditional random fields [19, 9, 15, 10], or dis-criminative models [8]. Recent work has extended patchesto more general poselet representations [5, 3], or incorpo-rated mixtures of parts [27] to obtain state of the art results.Our pose estimation subgoal builds on this lastmethod [27],extending the approach to incorporate clothing estimationsin models for pose identification.

Image Parsing: Image parsing has been studied as a steptoward general image understanding [21, 12, 24]. We con-sider a similar problem (parsing) and take a related ap-

Page 63: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Annotation tools Web-based annotation tools at Amazon Mechanical Turk

Lesson: Segmentation is too hard for MTurk workers

Page 65: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

CRF-based parsing

null

shoes

shirt

jeans

hair

skin

null

tights

jacket

dress

hat

heels

hair

skin

null

shorts

blouse

bracelet

wedges

hair

skin

null

shoes

top

stockings

hair

skin

Figure 4: Successful results on the Fahionista dataset.

null

purse

boots

sweater

hat

bracelet

hair

skin

(a) Skin-like color

null

t-shirt

shoes

jacket

hair

skin

(b) Failing pose estimate

null

tights

boots

jacket

dress

hat

hair

skin

(c) Spill in the background

null

purse

dress

accessories

belt

heels

hair

skin

(d) Coarse pattern

Figure 5: Failure cases

the art [27]) As motivation for future research on clothingestimation, we also find that given true clothing labels ourpose re-estimation system reaches a PCPof 89.5% , demon-strating the potential usefulness of incorporating clothinginto pose identification.

4.4. Retrieving Visually Similar Garments

We build a prototype system to retrieve garment itemsvia visual similarity in the Fashionista dataset. For eachparsed garment item, we compute normalized histogramsof RGB and L*a*b* color within the predicted labeled re-

CVPR 12

Page 66: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Failure cases CVPR 12

null

shoes

shirt

jeans

hair

skin

null

tights

jacket

dress

hat

heels

hair

skin

null

shorts

blouse

bracelet

wedges

hair

skin

null

shoes

top

stockings

hair

skin

Figure 5: Example successful results on the Fahionista dataset.

null

purse

boots

sweater

hat

bracelet

hair

skin

(a) Skin-like color

null

t-shirt

shoes

jacket

hair

skin

(b) Failing pose estimate

null

tights

boots

jacket

dress

hat

hair

skin

(c) Spill in the background

null

purse

dress

accessories

belt

heels

hair

skin

(d) Coarse pattern

Figure 6: Example failure cases

4.4. Retr ieving Visually Similar Garments

We build a prototype system to retrieve garment itemsvia visual similarity in the Fashionista dataset. For eachparsed garment item, we compute normalized histogramsof RGB and L*a*b* color within the predicted labeled re-

gion, and measure similarity between items by Euclideandistance. For retrieval, we prepare a query image and obtaina list of images ordered by visual similari ty. Figure 1 showsa few of top retrieved results for images displaying shorts,blazer , and t-shirt (query in leftmost col, retrieval results

Page 67: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Paper Doll parsing Retrieval-based approach

Paper Doll Dataset

NN

images NN

images Similar

images

Candidate tags

Image Parser

Tagged images

Web 1. Get tagged images on the Web 2. Retrieve similar images 3. Use them to predict items

Page 68: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Tagged images on the Web

Dress Hat

Heels Sweater Heels

Page 69: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Paper Doll dataset ~339,000 images

Page 70: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Retrieval example

bag cardigan heels shorts top

boots skirt

flats necklace shirt skirt

belt pumps skirt t-shirt

belt shirt shoes skirt tights

skirt top

Query

dress shoes skirt tights belt top

Candidate tags ...

...

Page 71: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Mixture of retrieval-based methods Global parsing

NN parsing

Transferred parsing

Combined parsing

Combine predictions

Final parsing

Smoothing

Input image

Similar styles

Candidate items

Page 72: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Results Input Truth Paper Doll CRF

Page 73: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Results Input Truth CRF Paper Doll

Page 74: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Big data benefits: performance

Data size

*CRF baseline doesn’t use big data

Page 75: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Big data benefits: qualitatively

Input

skin

hair

bag

boots

dress

skirt

top

Data size = 256

Data size = 262,144

accessories bag boots dress necklace shoes shorts skirt top

bag boots dress heels skirt sunglasses top

Parsing

Page 76: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Data-driven clothing parsing

• Fashionista dataset – small – completely-annotated

• Paper Doll dataset

– large – user-annotated

Page 77: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

POPULARITY ANALYSIS MM2014

Page 78: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Predicted most popular

Predicted least popular

Popularity prediction

Page 79: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Regression analysis in 300K posts

• Tag TF-IDF • Image

composition • Color entropy • Style descriptor • Parse descriptor

Popularity

• User identity • Previous posts • Node degrees

Input Output

Social factors

Content factors

• Votes

Page 80: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Like button in Chictopia

Long tail

Promotion effect?

Page 81: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Findings

• The outfit doesn’t matter (!!!)

• Popularity is mostly the outcome of the social network – social bias – #votes ∝ #followers – People just click on friends’ photos

Page 82: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Regression performance Factors R2 Spearman Accuracy

top 25% Accuracy top 75%

Social 0.491 0.682 0.847 0.779 Content 0.248 0.488 0.778 0.737 Social + Content

0.493 0.685 0.845 0.775

Social factors significantly boost the performance

Page 83: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Rich-get-richer phenomena

• Popularity growth of a linked content is proportional to the current popularity

Easley and Kleinberg 2010

Page 84: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

What if there is no social network?

• Popularity = f ( content factors )?

Page 85: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Crowdsourcing! • Collecting popularity

votes in Amazon MTurk

• No network!

3000 pictures 25 assignments

Page 86: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Out-of-network popularity

#images

#votes

No social factor in the voting process

Page 87: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Task

• Predict crowd popularity using Content factors and/or Social factors in Chictopia

Social factors

Chictopia

Content factors

MTurk

Voting data

?

Page 88: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Predicting crowd votes Factors R2 Spearman Accuracy

top 25% Accuracy top 75%

Social 0.423 0.634 0.845 0.787 Content 0.428 0.647 0.888 0.862 Social + Content

0.473 0.686 0.884 0.858

• Content factors matter • Social factors from Chictopia predict crowd votes well

• User-content correlation: Top-bloggers consistently post

good pictures

Page 89: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Lessons

• Crowdsourcing is not only for getting ground-truth, but to study human behavior

• Research opportunity for social visual media

Page 90: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

STUDYING FASHION STYLES ECCV2014

Page 91: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Q: What makes the boy on the right look Harajuku-style?

Tie? Shoes?

tokyofashion.com

Page 92: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Goal

• Finding what constitutes a fashion style

• Approach – Game-based annotation – Attribute factorization

Goth

Page 93: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Who’s more Bohemian?

Page 94: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Other people think...

hipsterwars.com

Page 95: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

hipsterwars.com Game-based relative ``style-ness’’ collection Asking our online friends for participation NO MONETARY REWARDS! Initial keyword-search on Google or Fashion SNS

Page 96: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Participation statistics

Most played the game only a few clicks

Some motivated users clicked A LOT

Page 97: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

TrueSkill game algorithm

• Algorithm to select which pair to play

• Idea: – Represent each image by Gaussian over

rating – Update Gaussian parameters after each click – Chooses expected-to-tie images for play

[R Herbrich, 2007]

Page 98: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Score distribution after game

Most Hipster

Least Hipster

Page 99: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Annotation examples

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

ECCV

# 1534ECCV

# 1534

10 ECCV-14 submission ID 1534

Most (Predicted) Least (Predicted)

Pin

up

G

oth

H

ipste

r B

ohe

mia

n

Pre

ppy

Fig. 5: Example results of within-classification task with δ = 0.5. Top and bottompredictions for each style category are shown.

5.2 W ith in -class classifi cat ion

Our next style recognition tasks considers classification between top rated andbottom rated examples for each style independently. Here the goal is, for ex-ample, to determine whether a person is an uber-hipster or only sort of hipster.Again, we utilize linear SVM s [27], but here learn one visual model for each stylesin our dataset. Here δ determines the percentage of top and bottom ranked im-ages used in the classification task. For example, δ = 0.1 means that we usethe top rated 10% of images from a style as positive samples and the bottomrated 10% of samples from the same style as negative samples (using the ratingscomputed in Sec 3.2) . W e evaluate experiments for δ ranging from 10% to 50% .W e repeat the experiments for 100 random folds with a 9 : 1 train to test ratio.I n each experiment, C , is determined using 5 fold cross-validation.Results are reported in F igure 6. W e observe that when δ is small we generally

have better performance than for larger δ. T his is because the classification taskgenerally becomes more challenging as we add less extreme examples of eachstyle. Additionally, we find best performance on the pinup category. Performanceon the goth category comes in second. For the hipster category, we do quite wellat di↵erentiating between extremely strong or weak examples, but performance

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

ECCV

# 1534ECCV

# 1534

10 ECCV-14 submission ID 1534

Most (Predicted) Least (Predicted)

Pin

up

Goth

H

ipste

r B

ohe

mia

n

Pre

ppy

Fig. 5: Example results of within-classification task with δ = 0.5. Top and bottompredictions for each style category are shown.

5. 2 W ith in -class classifi cat ion

Our next style recognition tasks considers classification between top rated andbottom rated examples for each style independently. Here the goal is, for ex-ample, to determine whether a person is an uber-hipster or only sort of hipster.Again, we utilize linear SVM s [27], but here learn one visual model for each stylesin our dataset. Here δ determines the percentage of top and bottom ranked im-ages used in the classification task. For example, δ = 0.1 means that we usethe top rated 10% of images from a style as positive samples and the bottomrated 10% of samples from the same style as negative samples (using the ratingscomputed in Sec 3.2) . W e evaluate experiments for δ ranging from 10% to 50% .W e repeat the experiments for 100 random folds with a 9 : 1 train to test ratio.I n each experiment, C , is determined using 5 fold cross-validation.Results are reported in F igure 6. W e observe that when δ is small we generally

have better performance than for larger δ. T his is because the classification taskgenerally becomes more challenging as we add less extreme examples of eachstyle. Additionally, we find best performance on the pinup category. Performanceon the goth category comes in second. For the hipster category, we do quite wellat di↵erentiating between extremely strong or weak examples, but performance

MOST LEAST

High-quality dataset without Amazon MTurk

Page 100: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Relative vs. absolute

• Asked MTurk workers 1-10 ratings

• Much noisier results from MTurk

Page 101: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Analyzing what makes her look preppy

Factorization results

Page 102: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Fashion style analysis

• Game-based annotation collected high-quality data without monetary rewards

• How can we collect seed images?

Page 103: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

STUDYING FASHION TRENDS WACV2015

Page 104: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Fashion trend: Runway to realway

Fashion show Street

Page 105: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Runway dataset ~35k images in 9k fashion shows over 15 years, from 2000 to 2014

Page 106: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Brands by photos

101

102

103

0

2

4

6

8

10

12

Photos

Bra

nds

Page 107: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

The query image is given in the left column, while five candidate images are shown in the right columns.

1. Select an image with the most similar outfit to the query. 2. If there is NO similar image, please select NONE.

Query image

NONE

Collecting human judgments to learn similarity

Select an image with the most similar outfit to the query image

Page 108: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Runway-to-runway retrieval Retrieving similar styles from other fashion shows

Page 109: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Runway-to-realway retrieval Retrieving similar styles from street snaps

Page 110: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Visually analyzing floral trend Runway image of floral Retrieved images in street with timestamp

Peaks in spring!

% retrieved images

Page 111: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Part II: Case studies

• Vital roles of data – Data-driven clothing parsing

• Small complete annotations, large-scale tags – Popularity analysis

• Verifying network phenomena using crowds – Studying fashion styles

• Game-based data collection – Studying fashion trends

• Learning human judgments to analyze trend

Page 112: Effective dataset construction in computer visionvision.is.tohoku.ac.jp/.../mva2015-effective-dataset-construction.pdfEffective Dataset Construction ... EDPMs increase the e xpressi

Effective dataset construction

Crowd-sourced datasets

Scale

Quality

1M 100K 10K 1K 10M 100M 1B

Raw, user-generated data

In-house datasets

10B 100

Driving force to computer vision

Wisdom of crowds