Visual Recognition with Humans in the Loop 1vision.ucsd.edu/tmp/visipedia/eccv2010_20q_submit.pdf · 1 Visual Recognition with Humans in the Loop 1 2 Anonymous ECCV submission 2 3

Visual Recognition with Humans in the Loop1 1

Anonymous ECCV submission2 2

Paper ID ***3 3

Abstract. We present an interactive, hybrid human-computer method4 4

for object classification. The method applies to classes of problems that5 5

are difficult for most people, but are recognizable by people with the ap-6 6

propriate expertise (e.g., animal species or airplane model recognition).7 7

The classification method can be seen as a visual version of the 20 ques-8 8

tions game, where questions based on simple visual attributes are posed9 9

interactively. The goal is to identify the true class while minimizing the10 10

number of questions asked, using the visual content of the image. In-11 11

corporating user input drives up recognition accuracy to levels that are12 12

good enough for practical applications; at the same time, computer vi-13 13

sion reduces the amount of human interaction required. The resulting14 14

hybrid system is able to handle difficult, large multi-class problems with15 15

tightly-related categories.16 16

We introduce a general framework for incorporating almost any off-the-17 17

shelf multi-class object recognition algorithm into the visual 20 questions18 18

game, and provide methodologies to account for imperfect user responses19 19

and unreliable computer vision algorithms. We evaluate the accuracy and20 20

computational properties of different computer vision algorithms and the21 21

effects of noisy user responses on a dataset of 200 bird species and on22 22

the Animals With Attributes dataset. Our results demonstrate the ef-23 23

fectiveness and practicality of the hybrid human-computer classification24 24

paradigm.25 25

1 Introduction26 26

Multi-class object recognition is a widely studied field in computer vision that27 27

has undergone rapid change and progress over the last decade. These advances28 28

have largely focused on types of object categories that are easy for humans29 29

to recognize, such as motorbikes, chairs, horses, bottles, etc. Finer-grained cate-30 30

gories, such as specific types of motorbikes, chairs, or horses are more difficult for31 31

humans and have received comparatively little attention. One could argue that32 32

object recognition as a field is simply not mature enough to tackle these types33 33

of finer-grained categories. Performance on basic-level categories is still lower34 34

than what people would consider acceptable for practical applications (state-of-35 35

the-art accuracy on Caltech-256[1] is ≈ 45%, and the winner of the 2009 VOC36 36

detection challenge [2] achieved only ≈ 28% average precision). Moreover, the37 37

number of object categories in most object recognition datasets is still fairly38 38

low, and increasing the number of categories further is usually detrimental to39 39

performance [1].40 40

2 ECCV-10 submission ID ***

(A) Easy for Humans (B) Hard for Humans (C) Easy for Humans

Chair? Airplane? … Finch? Bunting?… Yellow Belly? Blue Belly? …Chair? Airplane? … Finch? Bunting?… Yellow Belly? Blue Belly? …

Fig. 1. Examples of classification problems that are easy or hard for humans.While basic-level category recognition (left) and recognition of low-level visual at-tributes (right) are easy for humans, most people struggle with finer-grained categories(middle). By defining categories in terms of low-level visual properties, hard classifica-tion problems can be turned into a sequence of easy ones.

On the other hand, recognition of finer-grained categories is an important41 41

problem to study – it can help people recognize types of objects they don’t yet42 42

know how to identify. We believe a hybrid human-computer recognition method43 43

is a practical intermediate solution toward applying contemporary computer44 44

vision algorithms to these types of problems. Rather than trying to solve object45 45

recognition entirely, we take on the objective of minimizing the amount of human46 46

labor required. As research in object recognition progresses, tasks will become47 47

increasingly automated, until eventually we will no longer need humans in the48 48

loop. This approach differs from some of the prevailing ways in which people49 49

approach research in computer vision, where researchers begin with simpler and50 50

less realistic datasets and progressively make them more difficult and realistic51 51

as computer vision improves (e.g., Caltech-4 → Caltech-101 → Caltech-256).52 52

The advantage of the human-computer paradigm is that we can provide usable53 53

services to people in the interim-period where computer vision is still unsolved.54 54

This may help increase demand for computer vision, spur data collection, and55 55

provide solutions for the types of problems people outside the field want solved.56 56

In this work, our goal is to provide a simple framework that makes it as57 57

effortless as possible for researchers to plug their existing algorithms into the58 58

human-computer framework and use humans to drive up performance to lev-59 59

els that are good enough for real-life applications. Implicit to our model is the60 60

assumption that lay-people generally cannot recognize finer-grained categories61 61

(e.g., Myrtle Warbler, Thruxton Jackaroo, etc.) due to imperfect memory or62 62

limited experiences; however, they do have the fundamental visual capabilities63 63

to recognize the parts and attributes that collectively make recognition possi-64 64

ble (see Fig. 1). By contrast, computers lack many of the fundamental visual65 65

capabilities that humans have, but have perfect memory and are able to pool66 66

knowledge collected from large groups of people. Users interact with our system67 67

by answering simple yes/no or multiple choice questions about an image or ob-68 68

ject, as shown in Fig. 2. Similar to the 20-questions game1, we observe that the69 69

1 See for example http://20q.net.

ECCV-10 submission ID *** 3Computer vision is helpful Computer vision is not helpfulComputer vision is helpful Computer vision is not helpful

The bird is a Black‐footed Albatross

Is the belly white? yesAre the eyes white? yesTh bi d i

Is the beak cone‐shaped? yesIs the upper‐tail brown? yesIs the breast solid colored? noIs the breast striped? yesI h h hi ?The bird is a

Parakeet AukletIs the throat white? yesThe bird is a Henslow’sSparrow

Fig. 2. Examples of the visual 20 questions game on the 200 class Bird dataset.Human responses (shown in red) to questions posed by the computer (shown in blue)are used to drive up recognition accuracy. In the left image, computer vision algorithmscan guess the bird species correctly without any user interaction. In the middle image,computer vision reduces the number of questions to 2. In the right image, computervision provides little help.

number of questions needed to classify an object from a database of C classes70 70

is usually O(logC) (when user responses are accurate), and can be faster when71 71

computer vision is in the loop. Our method of choosing the next question to ask72 72

uses an information gain criterion and can deal with noisy (probabilistic) user73 73

responses. We show that it is easy to incorporate any computer vision algorithm74 74

that can be made to produce a probabilistic output over object classes.75 75

Our experiments in this paper focus on bird species categorization, which we76 76

take to be a representative example of recognition of tightly-related categories.77 77

The bird dataset contains 200 bird species and over 6,000 images. We believe78 78

that the same types of methodologies used for birds will apply to other object79 79

domains.80 80

The structure of the paper is as follows: In Section 2, we discuss related81 81

work. In Section 3, we define the hybrid human-computer problem and basic82 82

algorithm, which includes methodologies for modeling noisy user responses and83 83

incorporating computer vision into the framework. We describe our datasets and84 84

implementation details in Section 4, and present empirical results in Section 5.85 85

2 Related Work86 86

Recognition of tightly related categories is still an open area in computer vi-87 87

sion, although there has been success in a few areas such as book covers and88 88

movie posters (i.e., rigid, mostly flat objects [3]). The problem is challenging89 89

because the number of object categories is larger, with low interclass variance,90 90

and variability in pose, lighting, and background causes high intraclass variance.91 91

Ability to exploit domain knowledge and cross-category patterns and similarities92 92

becomes increasingly important.93 93

There exist a variety of datasets related to recognition of tightly-related cat-94 94

egories, including Oxford Flowers 102 [4], Oxford Birds [5], and STONEFLY995 95

[6]. While these works represent progress, they still have shortcomings in scaling96 96

to large numbers of categories, applying to other types of object domains, or97 97

achieving performance levels that are good enough for real-world applications.98 98

Perhaps most similar in spirit to our work is the Botanist’s Field Guide [7],99 99

a system for plant species recognition with hundreds of categories and tens of100 100

thousands of images. One key difference is that their system is intended pri-101 101

marily for experts, and requires plant leaves to be photographed in a controlled102 102


manner at training and test time, making segmentation and pose normalization103 103

possible. In contrast, all of our training and testing images are obtained from104 104

Flickr in unconstrained settings (see Fig. 3), and the system is intended to be105 105

used by lay-people.106 106

There exists a multitude of different areas in computer science that interleave107 107

vision, learning, or other processing with human input. Relevance feedback [8,9]108 108

is a method for interactive image retrieval, in which users mark the relevancy109 109

of image search results, which are in turn used to create a refined search query.110 110

Active learning algorithms [10,11,12] interleave training a classifier with asking111 111

users to label examples, where the objective is to minimize the total number of112 112

labeling tasks. Our objectives are somewhat similar, except that we are querying113 113

information at runtime rather than training time. Perhaps the most similar area114 114

to our application is expert systems [13,14], which are used for applications such115 115

as medical diagnosis, accounting, process control, and software troubleshoot-116 116

ing. Expert systems attempt to answer a problem that could normally only be117 117

solved by one or more experts, and involve construction of a knowledge base and118 118

inference rules that can be used to synthesize a set of steps or input queries dy-119 119

namically. Our method can be seen as an application of expert systems to object120 120

recognition, with the key addition that we are able to use the observed image121 121

pixels as an additional source of information. Computationally, our method also122 122

has similarities to algorithms based on information gain, entropy calculation,123 123

and decision trees [15,16,17,18].124 124

Finally, a lot of progress has been made on trying to scale object recognition125 125

to large numbers of categories. Such approaches include using class taxonomies126 126

[19,20], feature sharing [21], error correcting output codes (ECOC) [22], and127 127

attribute based classification methods [23,24,25]. All of these methods could be128 128

easily plugged into our framework to incorporate user interaction.129 129

Ivory GullBank Swallow Indigo Bunting Whip−poor−will Chuck−will’s−widow

guessing probably definitely

back colorback pattern

belly colorbelly pattern

bill shapebreast color

breast patterncrown color

eye colorforehead color

head patternleg color

nape colorprimary color

shapesize

tail patternthroat color

under tail colorunderparts color

upper tail colorupperparts color

wing colorwing patternwing shape

guessing probably definitely guessing probably definitely guessing probably definitely

guessing probably definitely0

0.2

0.4

0.6

0.8

1

Fig. 3. Examples of user responses for each of the 25 attributes. The distribu-tion over {Guessing,Probably,Definitely} is color coded with blue denoting 0% and reddenoting 100% of the five answers per image attribute pair.

ECCV-10 submission ID *** 5

3 Visual Recognition With Humans in the Loop130 130

Given an image x, our goal is to determine the true object class c ∈ {1...C} by131 131

posing questions based on visual properties that are easy for the user to answer132 132

(see Fig. 1). At each step, we aim to exploit the visual content of the image and133 133

the current history of question responses to intelligently select the next question.134 134

The basic algorithm flow is summarized in Fig. 4.135 135

Let Q = {q1...qn} be a set of random variables corresponding to all possible136 136

questions (e.g., IsRed?, HasStripes?, etc.), and A be the set of possible answers2,137 137

such that qi ∈ A. We can also ask users to select a confidence value for each138 138

question; let ri be a random variable corresponding to the confidence reported for139 139

question i, and V be the set of possible confidence scores (e.g., Guessing, Probably,140 140

Definitely), such that ri ∈ V. For convenience we define variables ui = (qi, ri); we141 141

will refer to these as “questions” from now and it should be clear from context142 142

when we mean question/confidence pairs.143 143

Let j ∈ {1...n}T be an array of T indices to questions we will ask the user.144 144

U t−1 = {uj(1)...uj(t−1)} is the set of questions asked by time step t − 1. At145 145

time step t we would like to find the question j(t), that maximizes the expected146 146

information gain. Information gain is widely used in decision trees (e.g. [18]) and147 147

can be computed from an estimate of p(c|x, U).148 148

We define I(c;u|x, U), the expected information gain of posing the additional149 149

question u, as follows:150 150

I(c;u|x, U) = Eu

[KL(p(c|x, u ∪ U) ‖ p(c|x, U)

)](1)

=∑

u∈A×Vp(u|x, U)

(H(c|x, u ∪ U)−H(c|x, U)

)(2)

and H(c|x, U) is the entropy of p(c|x, U)151 151

H(c|x, U) = −C∑

c=1

p(c|x, U) log p(c|x, U) (3)

The general algorithm for interactive object recognition is shown in Algorithm152 152

1. In the next sections, we describe in greater detail methods for modeling user153 153

responses and different methods for incorporating computer vision algorithms,154 154

which correspond to different ways to estimate p(c|x, U).155 155

3.1 Incorporating Computer Vision156 156

When no computer vision is involved it is possible to pre-compute a decision157 157

tree that defines which question to ask for every possible sequence of question158 158

answers. With computer vision in the loop, however, the best questions depend159 159

dynamically on the contents of the image.160 160

2 We model user answers as binary questions; extensions to multiple choice questionsare readily available and may be desirable in a future version of the system.


Algorithm 1 Visual 20 Questions Game1: U0 ← ∅2: for t = 1 to 20 do3: j(t) = maxk I(c; uk|x, U t−1)4: Ask user question uj(t), and U t ← U t−1 ∪ uj(t).5: end for6: Return class c∗ = maxc p(c|x, U t)

Question 1: Question 2:Computer Vision

Question 1:Is the belly black?

A: NO

Question 2:Is the bill hooked?

A: YES

Input Image ( )Input Image ( )

Fig. 4. Visualization of the basic algorithm flow. The system poses questionsto the user, which along with computer vision, incrementally refine the probabilitydistribution over classes.

In this section, we propose a simple framework for incorporating any multi-161 161

class object recognition algorithm that produces a probabilistic output over162 162

classes. We can compute the posterior as follows:163 163

p(c|x, U) ∝ p(U |c, x)p(c|x) = p(U |c)p(c|x) (4)

Here we make an assumption that p(U |c, x) = p(U |c); effectively this assumes164 164

that the types of noise or randomness that we see in user responses is class-165 165

dependent and not image-dependent. We can still accommodate for variation166 166

in user responses due to user error, subjectivity, external factors, and intraclass167 167

variance; however we throw away some image-related information (for example,168 168

we lose ability to model a change in the distribution of user responses as a result169 169

of a computer-vision-based estimate of object pose).170 170

In terms of computation, we estimate p(c|x) using a classifier trained offline171 171

(more details in Section 4.3). Upon receiving an image, we run the classifier once172 172

at the beginning of the process, and incrementally update p(c|x, U) by gathering173 173

more answers to questions from the user. One could imagine a system where174 174

computer vision is invoked several times during the process; as categories are175 175

weeded out by answers, the system would use a more tuned classifier to update176 176

the estimate of p(c|x). However, our preliminary experiments with such methods177 177

did not show an advantage3. Note that when no computer vision is involved, we178 178

simply replace p(c|x) with a prior p(c).179 179

3.2 Modeling User Responses180 180

Recall that for each question we may also ask a corresponding confidence value181 181

from the user, which may be necessary when an attribute cannot be determined182 182

3 See supplementary material for more details on this.


(for example, when the associated part(s) are not visible). We estimate the dis-183 183

tribution p(U |c) as follows:184 184

p(U |c) =t∏i

p(ui|c) (5)

In the above we assume that the questions are answered independently given the185 185

category. If a single user is answering all of the questions, then this assumption186 186

might not hold (responses are correlated due to per-user subjectivity); however,187 187

in other variants of our application we may want to crowd source other users188 188

to answer these questions for the purpose of labeling many images (similar to189 189

ReCAPTCHA [26]), in which case answers would come from different people. It190 190

may also be possible to use a more sophisticated model in which we estimate a191 191

full joint distribution for p(U |c); in our preliminary experiments this approach192 192

did not work well due to insufficient training data.193 193

To compute p(ui|c) = p(qi, ri|c) = p(qi|ri, c)p(ri|c), we assume that p(ri|c)194 194

is uniform. Next, we compute each p(qi|ri, c) as the posterior of a multinomial195 195

distribution with Dirichlet prior Dir(αrp(qi|ri)+αcp(qi|c)

), where αr and αc are196 196

constants, p(qi|ri) is a global attribute prior, and p(qi|c) is estimated by pooling197 197

together certainty labels. Incorporating prior terms is important in order to avoid198 198

over-fitting when the training examples for any attribute-class pair are sparse. In199 199

practice, we use a larger prior term for Guessing than Definitely, αguess > αdef ,200 200

which effectively down weights the importance of any response with certainty201 201

level Guessing.202 202

4 Datasets and Implementation Details203 203

In this section we provide a brief overview of the datasets we used, methods204 204

used to construct visual questions, computer vision algorithms we tested, and205 205

parameter settings.206 206

4.1 Birds-200 Dataset207 207

Birds-200 is a dataset of 6033 images over 200 bird species, such as Myrtle War-208 208

blers, Pomarine Jaegars, and Black-footed Albatrosses – classes which cannot209 209

usually be identified by non-experts. In many cases, different bird species are210 210

nearly visually identical (see Fig. 8).211 211

We assembled a set of 25 visual questions (list shown in Fig. 3), which en-212 212

compass 288 binary attributes (e.g., the question HasBellyColor can take on 15213 213

different possible colors). The list of attributes was extracted from whatbird.com214 214

[27], a bird field guide website.215 215

We collected “deterministic” class-attributes by parsing attributes from what-216 216

bird.com. Additionally, we collected data of how non-expert users respond to217 217

attribute questions via a Mechanical Turk interface. To minimize the effects of218 218

user subjectivity and error, our interface provides prototypical images of each219 219


possible attribute response. Screenshots of the question answering user-interface220 220

are included in supplemental material.221 221

Fig. 3 shows a visualization of the types of user response results we get on the222 222

Birds-200 dataset. It should be noted that the uncertainty of the user responses223 223

strongly correlates with the parts that are visible in an image as well as overall224 224

difficulty of the corresponding bird species.225 225

When evaluating performance, test results are generated by randomly select-226 226

ing a response returned by an MTurk user for the appropriate test image.227 227

4.2 Animals With Attributes228 228

We also tested performance on the Animals With Attributes (AwA) [23], a229 229

dataset of 50 animal classes and 85 binary attributes. We consider this dataset230 230

less relevant than birds (because classes are not tightly related), and therefore231 231

do not focus as much on this dataset.232 232

4.3 Implementation Details and Parameter Settings233 233

For both datasets, our computer vision algorithms are based on Andrea Vedaldi’s234 234

publicly available source code [28], which combines vector-quantized geometric235 235

blur and color/gray SIFT features using spatial pyramids, multiple kernel learn-236 236

ing, and per-class 1-vs-all SVMs. We added additional features based on full237 237

image color histograms and vector-quantized color histograms. For each classi-238 238

fier we used Platt scaling [29] to learn parameters for p(c|x) on a validation set.239 239

We used 15 training examples for each Birds-200 class and 30 training examples240 240

for each AwA class. Bird training and testing images are roughly cropped.241 241

Additionally, we compare performance to a second computer vision algorithm242 242

based on attribute classifiers, which we train using the same features/training243 243

code, with positive and negative examples set using whatbird.com attribute la-244 244

bels. We combined attribute classifiers into per-class probabilities p(c|x) using245 245

the method described in [23].246 246

For estimating user response statistics on the Birds-200 dataset, we used247 247

αguess = 64, αprob = 16, αdef = 8, and αc = 8 (see Section 3.2).248 248

5 Experiments249 249

In this section, we provide experimental results and analysis of the hybrid-human250 250

computer classification paradigm. Due to space limitations, our discussion fo-251 251

cuses on the Birds dataset. We include results (see Fig. 9) from which the user252 252

can verify that trends are similar on Birds-200 and AwA, and we include addi-253 253

tional results on AwA in the supplementary material.254 254

5.1 Measuring Performance255 255

We use two main methodologies for measuring performance, which correspond256 256

to two different possible user-interfaces:257 257


0 10 20 30 40 50 600

0.2

0.4

0.6

0.8

1

Number of Binary Questions Asked

Per

cent

Cla

ssifi

ed C

orre

ctly

Deterministic Users

MTurk Users

MTurk Users + Model

Rose‐breasted Grosbeak

Q: Is the belly red? yes (Def)Q: Is the breast black? yes (Def.)Q : Is the primary color red? yes (Def.)

Fig. 5. Different Models of User Responses: Left: Classification performance onBirds-200 (Method 1) without computer vision. Performance rises quickly (blue curve)if users respond deterministically according to whatbird.com attributes. MTurk usersrespond much differently, resulting in low performance (green curve). A learned modelof MTurk responses is much more robust (red curve). Right: A test image where usersanswer several questions incorrectly and our learned model still classifies the imagecorrectly.

– Method 1: We ask the user exactly T questions, predict the class with258 258

highest probability, and measure the percent of the time that we are correct.259 259

– Method 2: After asking each question, we present to the user a small gallery260 260

of images of the class with highest probability and assume that the user261 261

will stop the system when presented with the correct class. In this case, we262 262

measure the average number of questions asked per test image.263 263

For the second method, we assume that people are perfect verifiers, e.g., they264 264

will stop the system if and only if they have been presented with the correct265 265

class. While this is not always possible in reality, there is some trade-off between266 266

classification accuracy and amount of human labor, and we believe that these267 267

two metrics collectively capture the most important considerations.268 268

5.2 Results269 269

User Responses are Stochastic: In this section, we present our results and270 270

discuss some interesting trends toward understanding the visual 20 questions271 271

classification paradigm.272 272

In Fig. 5, we show the effects of different models of user responses without273 273

using any computer vision. When users are assumed to respond deterministically274 274

in accordance with the attributes from whatbird.com, performance rises quickly275 275

to 100% within 8 questions (roughly log2(200)). However, this assumption is276 276

not realistic; when testing with responses from Mechanical Turk, performance277 277

saturates at around 5%. Low performance caused by subjective answers are278 278

unavoidable (e.g., perception of the color brown vs. the color buff), and the279 279

probability of the correct class drops to zero after any inconsistent response.280 280

Although performance is 10 times better than random chance, it renders the281 281


0 10 20 30 40 500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8


Per

cent

Cla

ssifi

ed C

orre

ctly

No CV

1−vs−all

Attribute

0 2 4 6 8 10 12 14 160

0.05

0.1

0.15

0.2

0.25

0.3

0.35


Per

cent

of T

ests

et Im

ages

No CV (10.64)1−vs−all (5.84)Attribute (6.29)

Fig. 6. Performance on Birds-200 when using computer vision: Left Plot:comparison of classification accuracy (Method 1) with and without computer visionwhen using MTurk user responses. Two different computer vision algorithms are shown,one based on per-class 1-vs-all classifiers and another based on attribute classifiers.Right plot: the number of questions needed to identify the true class (Method 2) dropsfrom 10.64 to 5.84 on average when incorporating computer vision.

system useless. This demonstrates a challenge for existing field guide websites282 282

in helping lay-people identify bird species. When our learned model of user re-283 283

sponses (see Section 3.2) is incorporated, performance jumps to 70% due to the284 284

ability to tolerate a reasonable degree of error in user responses (see Fig. 5 for an285 285

example). Nevertheless, stochastic user responses increase the number of ques-286 286

tions required to achieve a given accuracy level, and some images can never be287 287

classified correctly, even when asking all possible questions. In Section 5.2, we288 288

discuss the reasons why performance saturates at lower than 100% performance.289 289

Computer Vision Reduces Manual Labor: The main benefit of computer290 290

vision occurs due to reduction in human labor (in terms of the number of ques-291 291

tions a user has to answer). In Fig. 6, we see that computer vision reduces the292 292

average number of yes/no questions needed to identify the true bird species from293 293

10.64 to 5.84 using responses from MTurk users. Without computer vision, the294 294

distribution of question counts is bell-shaped and centered around 7 questions.295 295

When computer vision is incorporated, the distribution peaks at 0 questions but296 296

is more heavy-tailed, which suggests that computer vision algorithms are often297 297

good at recognizing the “easy” test examples (examples that are sufficiently sim-298 298

ilar to the training data), but provide diminishing returns toward classifying the299 299

harder examples that are not sufficiently similar to training data. As a result,300 300

computer vision is more effective at reducing the average amount of time nec-301 301

essary to classify an image than reducing the time spent on the most difficult302 302

images.303 303

User Responses Drive Up Performance: An alternative way of interpret-304 304

ing the results is that user responses drive up the accuracy of computer vision305 305

algorithms. In Fig. 6, we see that user responses improve overall performance306 306

from ≈ 27% (using 0 questions) to ≈ 72%.307 307


Western GrebeRose‐breasted Grosbeak

Yellow‐headed Blackbird

Only CV

w/ vision:Q #1: Is the throat white? yes (Def.) w/o vision:

CV + Q #1: Is the crown black? yes (Def.)

Rose‐breasted /

Q #1: Is the shape perching‐like? no (Def.) Grosbeak

Western GrebeRose‐breasted Grosbeak

Yellow‐headed Blackbird

Only CV

w/ vision:Q #1: Is the throat white? yes (Def.) w/o vision:

CV + Q #1: Is the crown black? yes (Def.)

Rose‐breasted /

Q #1: Is the shape perching‐like? no (Def.) Grosbeak

Fig. 7. Examples where computer vision and user responses work together:Left: An image that is only classified correctly when computer vision is incorporated.Additionally, the computer vision based method selects the question HasThroatColor-White, a different and more relevant question than when vision is not used. In the rightimage, the user response to HasCrownColorBlack helps correct computer vision whenits initial prediction is wrong.

Computer Vision Improves Overall Performance: Even when users an-308 308

swer all questions, performance saturates at a higher level when using computer309 309

vision (≈ 72% vs. ≈ 67%, see Fig. 6). The left image in Fig. 7 shows an example310 310

of an image classified correctly using computer vision, which is not classified cor-311 311

rectly without computer vision, even after asking 60 questions. In this example,312 312

some visually salient features like the long neck are not captured in our list of313 313

visual attribute questions. The features used by our vision algorithms also cap-314 314

ture other cues (such as global texture statistics) that are not well-represented315 315

in our list of attributes (which capture mostly color and part-localized patterns).316 316

Different Questions Are Asked With and Without Computer Vision:317 317

In general, the information gain criterion favors questions that 1) can be an-318 318

swered reliably, and 2) split the set of possible classes roughly in half. Questions319 319

like HasShapePerchingLike, which divide the classes fairly evenly, and HasUnder-320 320

partsColorYellow, which tends to be answered reliably, are commonly chosen.321 321

When computer vision is incorporated, the likelihood of classes change and322 322

different questions are selected. In the left image of Fig. 7, we see an example323 323

where a different question is asked with and without computer vision, which324 324

allows the computer vision based method to hone in on the correct class using325 325

one question.326 326

Recognition is Not Always Successful: According the the Cornell Ornithol-327 327

ogy Website [30], the four keys to bird species recognition are 1) size and shape,328 328

2) color and pattern, 3) behavior, and 4) habitat. Bird species classification is329 329

a difficult problem and is not always possible using a single image. One po-330 330

tential advantage of the visual 20 questions paradigm is that other contextual331 331

sources of information such as behavior and habitat can easily be incorporated332 332

as additional questions.333 333


Least Auklet Sayornis Gray KingbirdParakeet Auklet

Q : Is the belly multi‐colored? yes (Def.)

Fig. 8. Images that are misclassified by our system: Left: The Parakeet Aukletimage is misclassified due to a cropped image, which causes an incorrect answer to thebelly pattern question (the Parakeet Auklet has a plain, white belly, see Fig. 2). Right:The Sayornis and Gray Kingbird are commonly confused due to visual similarity.

Fig. 8 illustrates some example failures. The most common failure conditions334 334

occur due to 1) classes that are nearly visually identical, 2) images of a poor335 335

viewpoint or low resolution where some parts are not visible, 3) significant mis-336 336

takes made by MTurkers, or 4) limitations in the particular set of attributes we337 337

selected.338 338

1-vs-all Vs. Attribute-Based Classification: In general, 1-vs-all classifiers339 339

slightly outperform attribute-based classifiers; however, they converge to similar340 340

performance as the number of question increases, as shown in Fig. 6 and 9. The341 341

features we use (kernelized and based on bag-of-words) may not be well suited342 342

to the types of attributes we are using, which tend to be localized and associated343 343

with a particular part. One potential advantage of attribute-based methods is344 344

computational scalability when the number of classes increases; whereas 1-vs-345 345

all methods always require C classifiers, the number of attribute classifiers can346 346

be varied in order to trade-off accuracy and computation time. The table below347 347

displays the average number of questions needed (Method 1) on the Birds dataset348 348

using different number of attribute classifiers (which were selected randomly):349 349

200 (1-vs-all) 288 attr. 100 attr. 50 attr. 20 attr. 10 attr.5.84 6.29 6.49 7.37 8.76 9.40

350 350

6 Conclusion351 351

Object recognition remains an incredibly challenging problem for computer vi-352 352

sion. Furthermore, recognizing tightly related categories in one shot is difficult353 353

even for humans without proper expertise. Our work attempts to leverage the354 354

power of both human recognition abilities and that of computer vision. We pre-355 355

sented a simple way of designing a hybrid human-computer classification system,356 356

which can be used in conjunction with a large variety of computer vision algo-357 357

rithms. Our results show that user input significantly drives up performance;358 358


0 2 4 6 8 10 12 14 16 18 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Per

cent

Cla

ssifi

ed C

orre

ctly

No CV1−vs−allAttribute

0 2 4 6 8 10 12 14 16 180

0.05

0.1

0.15

0.2

0.25

0.3

0.35


Per

cent

of T

ests

et Im

ages

No CV (5.95)1−vs−all (4.31)Attribute (4.11)

Fig. 9. Performance on Animals With Attributes: Left Plot: Classification per-formance (Method 1), simulating user responses using soft class-attributes (see [23]).Right Plot: The required number of questions needed to identify the true class (Method2) drops from 5.94 to 4.11 on average when incorporating computer vision.

while it may take many years before object recognition algorithms achieve rea-359 359

sonable performance on their own, incorporating human input can produce us-360 360

able recognition systems. On the other hand, having computer vision in the loop361 361

reduces the amount of required human labor to successfully classify an image.362 362

Finally, we showed that incorporating models of stochastic user responses leads363 363

to much better reliability in comparison to deterministic field guides generated364 364

by experts.365 365

We believe our work opens the door to many interesting sub-problems. The366 366

most obvious next step is to explore other types of super-categories. While we367 367

were able to extract a set of reasonable attributes/questions for the bird dataset,368 368

this may be more difficult for other domains; one possible future work is to369 369

find a more principled way of discovering a set of useful questions. Alternative370 370

types of user input, such as asking the user to click on the location of certain371 371

parts, could also be investigated. Lastly, while we used off-the-shelf computer372 372

vision algorithms in this work, it may be possible to improve them to better suit373 373

the challenges of tightly-related category recognition, such as algorithms that374 374

incorporate a part-based model.375 375

References376 376

1. Griffin, G., Holub, A., Perona, P.: Caltech-256 object category dataset. Technical377 377

Report 7694, California Institute of Technology (2007)378 378

2. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The379 379

PASCAL VOC Challenge 2009 Results. (http://www.pascal-network.org/380 380

challenges/VOC/voc2009/workshop/index.html)381 381

3. Nister, D., Stewenius, H.: Scalable recognition with a vocabulary tree. In: CVPR.382 382

(2006)383 383

4. Nilsback, M., Zisserman, A.: Automated flower classification over a large number of384 384

classes. In: Indian Conf. on Comp. Vision, Graphics & Image Proc. (2008) 722–729385 385

5. Lazebnik, S., Schmid, C., Ponce, J.: A maximum entropy framework for part-based386 386

texture and object recognition. In: ICCV. Volume 1. (2005) 832–838387 387

http://www.pascal-network.org/challenges/VOC/voc2009/workshop/index.html




6. Martınez-Munoz, et al.: Dictionary-free categorization of very similar objects via388 388

stacked evidence trees. In: CVPR. (2009)389 389

7. Belhumeur, P., Chen, D., Feiner, S., Jacobs, D., Kress, W., Ling, H., Lopez, I.,390 390

Ramamoorthi, R., Sheorey, S., White, S., Zhang, L.: Searching the world’s herbaria:391 391

A system for visual identification of plant species. In: ECCV. (2008) 116–129392 392

8. Zhou, X., Huang, T.: Relevance feedback in image retrieval: A comprehensive393 393

review. Multimedia Systems 8 (2003) 536–544394 394

9. Tao, D., Tang, X., Li, X., Wu, X.: Asymmetric bagging and random subspace for395 395

support vector machines-based relevance feedback in image retrieval. PAMI 28396 396

(2006) 1088–1099397 397

10. Tong, S., Koller, D.: Support vector machine active learning with applications to398 398

text classification. JMLR 2 (2002) 45–66399 399

11. Kapoor, A., Grauman, K., Urtasun, R., Darrell, T.: Active learning with gaussian400 400

processes for object categorization. In: ICCV. (2007) 1–8401 401

12. Holub, A., Perona, P., Burl, M.: Entropy-based active learning for object recogni-402 402

tion. In: Workshop on Online Learning for Classification (OLC). (2008) 1–8403 403

13. Neapolitan, R.E.: Probabilistic reasoning in expert systems: theory and algorithms.404 404

John Wiley & Sons, Inc., New York, NY, USA (1990)405 405

14. Beynon, M., Cosker, D., Marshall, D.: An expert system for multi-criteria decision406 406

making using Dempster Shafer theory. Expert Systems with Applications 20 (2001)407 407

15. Tsang, S., Kao, B., Yip, K., Ho, W., Lee, S.: Decision trees for uncertain data. In:408 408

International Conference on Data Engineering (ICDE). (2009)409 409

16. Elouedi, Z., Mellouli, K., Smets, P.: Belief decision trees: theoretical foundations.410 410

International Journal of Approximate Reasoning 28 (2001) 91–124411 411

17. Jeon, J., Manmatha, R.: Using Maximum Entropy for Automatic Image Annota-412 412

tion. In: International Conference on Image and Video Retrieval (CIVR). (2004)413 413

18. Quinlan, J.R.: C4.5: programs for machine learning. Morgan Kaufmann Publishers414 414

Inc., San Francisco, CA, USA (1993)415 415

19. Sivic, J., Russell, B., Zisserman, A., Freeman, W., Efros, A.: Unsupervised discov-416 416

ery of visual object class hierarchies. In: CVPR. (2008) 1–8417 417

20. Griffin, G., Perona, P.: Learning and using taxonomies for fast visual categoriza-418 418

tion. In: CVPR. (2008) 1–8419 419

21. Torralba, A., Murphy, K., Freeman, W.: Sharing features: efficient boosting pro-420 420

cedures for multiclass object detection. In: CVPR. Volume 2. (2004)421 421

22. Dietterich, T., Bakiri, G.: Solving multiclass learning problems via error-correcting422 422

output codes. Journal of Artificial Intelligence Research 2 (1995) 263–286423 423

23. Lampert, C., Nickisch, H., Harmeling, S.: Learning to detect unseen object classes424 424

by between-class attribute transfer. In: CVPR. (2009)425 425

24. Farhadi, A., Endres, I., Hoiem, D., Forsyth, D.: Describing objects by their at-426 426

tributes. In: CVPR. (2009)427 427

25. Kumar, N., Berg, A.C., Belhumeur, P.N., Nayar, S.K.: Attribute and Simile Clas-428 428

sifiers for Face Verification. In: ICCV. (2009)429 429

26. Von Ahn, L., Maurer, B., McMillen, C., Abraham, D., Blum, M.: Recaptcha:430 430

Human-based character recognition via web security measures. Science 321 (2008)431 431

27. Waite, M.: whatbird.com. (http://www.whatbird.com/)432 432

28. Vedaldi, A., Gulshan, V., Varma, M., Zisserman, A.: Multiple kernels for object433 433

detection. In: ICCV. (2009)434 434

29. Platt, J.: Probabilities for SV machines. In: NIPS. (1999) 61–74435 435

30. : www.allaboutbirds.org. (http://www.allaboutbirds.org/NetCommunity/page.436 436

aspx?pid=1053)437 437

http://www.whatbird.com/

http://www.allaboutbirds.org/NetCommunity/page.aspx?pid=1053



Documents

Visual Recognition with Humans in the Loop 1vision.ucsd.edu/tmp/visipedia/eccv2010_20q_submit.pdf · 1 Visual Recognition with Humans in the Loop 1 2 Anonymous ECCV submission 2 3