Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Visual Recognition with Humans in the Loop1 1
Anonymous ECCV submission2 2
Paper ID ***3 3
Abstract. We present an interactive, hybrid human-computer method4 4
for object classification. The method applies to classes of problems that5 5
are difficult for most people, but are recognizable by people with the ap-6 6
propriate expertise (e.g., animal species or airplane model recognition).7 7
The classification method can be seen as a visual version of the 20 ques-8 8
tions game, where questions based on simple visual attributes are posed9 9
interactively. The goal is to identify the true class while minimizing the10 10
number of questions asked, using the visual content of the image. In-11 11
corporating user input drives up recognition accuracy to levels that are12 12
good enough for practical applications; at the same time, computer vi-13 13
sion reduces the amount of human interaction required. The resulting14 14
hybrid system is able to handle difficult, large multi-class problems with15 15
tightly-related categories.16 16
We introduce a general framework for incorporating almost any off-the-17 17
shelf multi-class object recognition algorithm into the visual 20 questions18 18
game, and provide methodologies to account for imperfect user responses19 19
and unreliable computer vision algorithms. We evaluate the accuracy and20 20
computational properties of different computer vision algorithms and the21 21
effects of noisy user responses on a dataset of 200 bird species and on22 22
the Animals With Attributes dataset. Our results demonstrate the ef-23 23
fectiveness and practicality of the hybrid human-computer classification24 24
paradigm.25 25
1 Introduction26 26
Multi-class object recognition is a widely studied field in computer vision that27 27
has undergone rapid change and progress over the last decade. These advances28 28
have largely focused on types of object categories that are easy for humans29 29
to recognize, such as motorbikes, chairs, horses, bottles, etc. Finer-grained cate-30 30
gories, such as specific types of motorbikes, chairs, or horses are more difficult for31 31
humans and have received comparatively little attention. One could argue that32 32
object recognition as a field is simply not mature enough to tackle these types33 33
of finer-grained categories. Performance on basic-level categories is still lower34 34
than what people would consider acceptable for practical applications (state-of-35 35
the-art accuracy on Caltech-256[1] is ≈ 45%, and the winner of the 2009 VOC36 36
detection challenge [2] achieved only ≈ 28% average precision). Moreover, the37 37
number of object categories in most object recognition datasets is still fairly38 38
low, and increasing the number of categories further is usually detrimental to39 39
performance [1].40 40
2 ECCV-10 submission ID ***
(A) Easy for Humans (B) Hard for Humans (C) Easy for Humans
Chair? Airplane? … Finch? Bunting?… Yellow Belly? Blue Belly? …Chair? Airplane? … Finch? Bunting?… Yellow Belly? Blue Belly? …
Fig. 1. Examples of classification problems that are easy or hard for humans.While basic-level category recognition (left) and recognition of low-level visual at-tributes (right) are easy for humans, most people struggle with finer-grained categories(middle). By defining categories in terms of low-level visual properties, hard classifica-tion problems can be turned into a sequence of easy ones.
On the other hand, recognition of finer-grained categories is an important41 41
problem to study – it can help people recognize types of objects they don’t yet42 42
know how to identify. We believe a hybrid human-computer recognition method43 43
is a practical intermediate solution toward applying contemporary computer44 44
vision algorithms to these types of problems. Rather than trying to solve object45 45
recognition entirely, we take on the objective of minimizing the amount of human46 46
labor required. As research in object recognition progresses, tasks will become47 47
increasingly automated, until eventually we will no longer need humans in the48 48
loop. This approach differs from some of the prevailing ways in which people49 49
approach research in computer vision, where researchers begin with simpler and50 50
less realistic datasets and progressively make them more difficult and realistic51 51
as computer vision improves (e.g., Caltech-4 → Caltech-101 → Caltech-256).52 52
The advantage of the human-computer paradigm is that we can provide usable53 53
services to people in the interim-period where computer vision is still unsolved.54 54
This may help increase demand for computer vision, spur data collection, and55 55
provide solutions for the types of problems people outside the field want solved.56 56
In this work, our goal is to provide a simple framework that makes it as57 57
effortless as possible for researchers to plug their existing algorithms into the58 58
human-computer framework and use humans to drive up performance to lev-59 59
els that are good enough for real-life applications. Implicit to our model is the60 60
assumption that lay-people generally cannot recognize finer-grained categories61 61
(e.g., Myrtle Warbler, Thruxton Jackaroo, etc.) due to imperfect memory or62 62
limited experiences; however, they do have the fundamental visual capabilities63 63
to recognize the parts and attributes that collectively make recognition possi-64 64
ble (see Fig. 1). By contrast, computers lack many of the fundamental visual65 65
capabilities that humans have, but have perfect memory and are able to pool66 66
knowledge collected from large groups of people. Users interact with our system67 67
by answering simple yes/no or multiple choice questions about an image or ob-68 68
ject, as shown in Fig. 2. Similar to the 20-questions game1, we observe that the69 69
1 See for example http://20q.net.
ECCV-10 submission ID *** 3Computer vision is helpful Computer vision is not helpfulComputer vision is helpful Computer vision is not helpful
The bird is a Black‐footed Albatross
Is the belly white? yesAre the eyes white? yesTh bi d i
Is the beak cone‐shaped? yesIs the upper‐tail brown? yesIs the breast solid colored? noIs the breast striped? yesI h h hi ?The bird is a
Parakeet AukletIs the throat white? yesThe bird is a Henslow’sSparrow
Fig. 2. Examples of the visual 20 questions game on the 200 class Bird dataset.Human responses (shown in red) to questions posed by the computer (shown in blue)are used to drive up recognition accuracy. In the left image, computer vision algorithmscan guess the bird species correctly without any user interaction. In the middle image,computer vision reduces the number of questions to 2. In the right image, computervision provides little help.
number of questions needed to classify an object from a database of C classes70 70
is usually O(logC) (when user responses are accurate), and can be faster when71 71
computer vision is in the loop. Our method of choosing the next question to ask72 72
uses an information gain criterion and can deal with noisy (probabilistic) user73 73
responses. We show that it is easy to incorporate any computer vision algorithm74 74
that can be made to produce a probabilistic output over object classes.75 75
Our experiments in this paper focus on bird species categorization, which we76 76
take to be a representative example of recognition of tightly-related categories.77 77
The bird dataset contains 200 bird species and over 6,000 images. We believe78 78
that the same types of methodologies used for birds will apply to other object79 79
domains.80 80
The structure of the paper is as follows: In Section 2, we discuss related81 81
work. In Section 3, we define the hybrid human-computer problem and basic82 82
algorithm, which includes methodologies for modeling noisy user responses and83 83
incorporating computer vision into the framework. We describe our datasets and84 84
implementation details in Section 4, and present empirical results in Section 5.85 85
2 Related Work86 86
Recognition of tightly related categories is still an open area in computer vi-87 87
sion, although there has been success in a few areas such as book covers and88 88
movie posters (i.e., rigid, mostly flat objects [3]). The problem is challenging89 89
because the number of object categories is larger, with low interclass variance,90 90
and variability in pose, lighting, and background causes high intraclass variance.91 91
Ability to exploit domain knowledge and cross-category patterns and similarities92 92
becomes increasingly important.93 93
There exist a variety of datasets related to recognition of tightly-related cat-94 94
egories, including Oxford Flowers 102 [4], Oxford Birds [5], and STONEFLY995 95
[6]. While these works represent progress, they still have shortcomings in scaling96 96
to large numbers of categories, applying to other types of object domains, or97 97
achieving performance levels that are good enough for real-world applications.98 98
Perhaps most similar in spirit to our work is the Botanist’s Field Guide [7],99 99
a system for plant species recognition with hundreds of categories and tens of100 100
thousands of images. One key difference is that their system is intended pri-101 101
marily for experts, and requires plant leaves to be photographed in a controlled102 102
4 ECCV-10 submission ID ***
manner at training and test time, making segmentation and pose normalization103 103
possible. In contrast, all of our training and testing images are obtained from104 104
Flickr in unconstrained settings (see Fig. 3), and the system is intended to be105 105
used by lay-people.106 106
There exists a multitude of different areas in computer science that interleave107 107
vision, learning, or other processing with human input. Relevance feedback [8,9]108 108
is a method for interactive image retrieval, in which users mark the relevancy109 109
of image search results, which are in turn used to create a refined search query.110 110
Active learning algorithms [10,11,12] interleave training a classifier with asking111 111
users to label examples, where the objective is to minimize the total number of112 112
labeling tasks. Our objectives are somewhat similar, except that we are querying113 113
information at runtime rather than training time. Perhaps the most similar area114 114
to our application is expert systems [13,14], which are used for applications such115 115
as medical diagnosis, accounting, process control, and software troubleshoot-116 116
ing. Expert systems attempt to answer a problem that could normally only be117 117
solved by one or more experts, and involve construction of a knowledge base and118 118
inference rules that can be used to synthesize a set of steps or input queries dy-119 119
namically. Our method can be seen as an application of expert systems to object120 120
recognition, with the key addition that we are able to use the observed image121 121
pixels as an additional source of information. Computationally, our method also122 122
has similarities to algorithms based on information gain, entropy calculation,123 123
and decision trees [15,16,17,18].124 124
Finally, a lot of progress has been made on trying to scale object recognition125 125
to large numbers of categories. Such approaches include using class taxonomies126 126
[19,20], feature sharing [21], error correcting output codes (ECOC) [22], and127 127
attribute based classification methods [23,24,25]. All of these methods could be128 128
easily plugged into our framework to incorporate user interaction.129 129
Ivory GullBank Swallow Indigo Bunting Whip−poor−will Chuck−will’s−widow
guessing probably definitely
back colorback pattern
belly colorbelly pattern
bill shapebreast color
breast patterncrown color
eye colorforehead color
head patternleg color
nape colorprimary color
shapesize
tail patternthroat color
under tail colorunderparts color
upper tail colorupperparts color
wing colorwing patternwing shape
guessing probably definitely guessing probably definitely guessing probably definitely
guessing probably definitely0
0.2
0.4
0.6
0.8
1
Fig. 3. Examples of user responses for each of the 25 attributes. The distribu-tion over {Guessing,Probably,Definitely} is color coded with blue denoting 0% and reddenoting 100% of the five answers per image attribute pair.
ECCV-10 submission ID *** 5
3 Visual Recognition With Humans in the Loop130 130
Given an image x, our goal is to determine the true object class c ∈ {1...C} by131 131
posing questions based on visual properties that are easy for the user to answer132 132
(see Fig. 1). At each step, we aim to exploit the visual content of the image and133 133
the current history of question responses to intelligently select the next question.134 134
The basic algorithm flow is summarized in Fig. 4.135 135
Let Q = {q1...qn} be a set of random variables corresponding to all possible136 136
questions (e.g., IsRed?, HasStripes?, etc.), and A be the set of possible answers2,137 137
such that qi ∈ A. We can also ask users to select a confidence value for each138 138
question; let ri be a random variable corresponding to the confidence reported for139 139
question i, and V be the set of possible confidence scores (e.g., Guessing, Probably,140 140
Definitely), such that ri ∈ V. For convenience we define variables ui = (qi, ri); we141 141
will refer to these as “questions” from now and it should be clear from context142 142
when we mean question/confidence pairs.143 143
Let j ∈ {1...n}T be an array of T indices to questions we will ask the user.144 144
U t−1 = {uj(1)...uj(t−1)} is the set of questions asked by time step t − 1. At145 145
time step t we would like to find the question j(t), that maximizes the expected146 146
information gain. Information gain is widely used in decision trees (e.g. [18]) and147 147
can be computed from an estimate of p(c|x, U).148 148
We define I(c;u|x, U), the expected information gain of posing the additional149 149
question u, as follows:150 150
I(c;u|x, U) = Eu
[KL(p(c|x, u ∪ U) ‖ p(c|x, U)
)](1)
=∑
u∈A×Vp(u|x, U)
(H(c|x, u ∪ U)−H(c|x, U)
)(2)
and H(c|x, U) is the entropy of p(c|x, U)151 151
H(c|x, U) = −C∑
c=1
p(c|x, U) log p(c|x, U) (3)
The general algorithm for interactive object recognition is shown in Algorithm152 152
1. In the next sections, we describe in greater detail methods for modeling user153 153
responses and different methods for incorporating computer vision algorithms,154 154
which correspond to different ways to estimate p(c|x, U).155 155
3.1 Incorporating Computer Vision156 156
When no computer vision is involved it is possible to pre-compute a decision157 157
tree that defines which question to ask for every possible sequence of question158 158
answers. With computer vision in the loop, however, the best questions depend159 159
dynamically on the contents of the image.160 160
2 We model user answers as binary questions; extensions to multiple choice questionsare readily available and may be desirable in a future version of the system.
6 ECCV-10 submission ID ***
Algorithm 1 Visual 20 Questions Game1: U0 ← ∅2: for t = 1 to 20 do3: j(t) = maxk I(c; uk|x, U t−1)4: Ask user question uj(t), and U t ← U t−1 ∪ uj(t).5: end for6: Return class c∗ = maxc p(c|x, U t)
Question 1: Question 2:Computer Vision
Question 1:Is the belly black?
A: NO
Question 2:Is the bill hooked?
A: YES
Input Image ( )Input Image ( )
Fig. 4. Visualization of the basic algorithm flow. The system poses questionsto the user, which along with computer vision, incrementally refine the probabilitydistribution over classes.
In this section, we propose a simple framework for incorporating any multi-161 161
class object recognition algorithm that produces a probabilistic output over162 162
classes. We can compute the posterior as follows:163 163
p(c|x, U) ∝ p(U |c, x)p(c|x) = p(U |c)p(c|x) (4)
Here we make an assumption that p(U |c, x) = p(U |c); effectively this assumes164 164
that the types of noise or randomness that we see in user responses is class-165 165
dependent and not image-dependent. We can still accommodate for variation166 166
in user responses due to user error, subjectivity, external factors, and intraclass167 167
variance; however we throw away some image-related information (for example,168 168
we lose ability to model a change in the distribution of user responses as a result169 169
of a computer-vision-based estimate of object pose).170 170
In terms of computation, we estimate p(c|x) using a classifier trained offline171 171
(more details in Section 4.3). Upon receiving an image, we run the classifier once172 172
at the beginning of the process, and incrementally update p(c|x, U) by gathering173 173
more answers to questions from the user. One could imagine a system where174 174
computer vision is invoked several times during the process; as categories are175 175
weeded out by answers, the system would use a more tuned classifier to update176 176
the estimate of p(c|x). However, our preliminary experiments with such methods177 177
did not show an advantage3. Note that when no computer vision is involved, we178 178
simply replace p(c|x) with a prior p(c).179 179
3.2 Modeling User Responses180 180
Recall that for each question we may also ask a corresponding confidence value181 181
from the user, which may be necessary when an attribute cannot be determined182 182
3 See supplementary material for more details on this.
ECCV-10 submission ID *** 7
(for example, when the associated part(s) are not visible). We estimate the dis-183 183
tribution p(U |c) as follows:184 184
p(U |c) =t∏i
p(ui|c) (5)
In the above we assume that the questions are answered independently given the185 185
category. If a single user is answering all of the questions, then this assumption186 186
might not hold (responses are correlated due to per-user subjectivity); however,187 187
in other variants of our application we may want to crowd source other users188 188
to answer these questions for the purpose of labeling many images (similar to189 189
ReCAPTCHA [26]), in which case answers would come from different people. It190 190
may also be possible to use a more sophisticated model in which we estimate a191 191
full joint distribution for p(U |c); in our preliminary experiments this approach192 192
did not work well due to insufficient training data.193 193
To compute p(ui|c) = p(qi, ri|c) = p(qi|ri, c)p(ri|c), we assume that p(ri|c)194 194
is uniform. Next, we compute each p(qi|ri, c) as the posterior of a multinomial195 195
distribution with Dirichlet prior Dir(αrp(qi|ri)+αcp(qi|c)
), where αr and αc are196 196
constants, p(qi|ri) is a global attribute prior, and p(qi|c) is estimated by pooling197 197
together certainty labels. Incorporating prior terms is important in order to avoid198 198
over-fitting when the training examples for any attribute-class pair are sparse. In199 199
practice, we use a larger prior term for Guessing than Definitely, αguess > αdef ,200 200
which effectively down weights the importance of any response with certainty201 201
level Guessing.202 202
4 Datasets and Implementation Details203 203
In this section we provide a brief overview of the datasets we used, methods204 204
used to construct visual questions, computer vision algorithms we tested, and205 205
parameter settings.206 206
4.1 Birds-200 Dataset207 207
Birds-200 is a dataset of 6033 images over 200 bird species, such as Myrtle War-208 208
blers, Pomarine Jaegars, and Black-footed Albatrosses – classes which cannot209 209
usually be identified by non-experts. In many cases, different bird species are210 210
nearly visually identical (see Fig. 8).211 211
We assembled a set of 25 visual questions (list shown in Fig. 3), which en-212 212
compass 288 binary attributes (e.g., the question HasBellyColor can take on 15213 213
different possible colors). The list of attributes was extracted from whatbird.com214 214
[27], a bird field guide website.215 215
We collected “deterministic” class-attributes by parsing attributes from what-216 216
bird.com. Additionally, we collected data of how non-expert users respond to217 217
attribute questions via a Mechanical Turk interface. To minimize the effects of218 218
user subjectivity and error, our interface provides prototypical images of each219 219
8 ECCV-10 submission ID ***
possible attribute response. Screenshots of the question answering user-interface220 220
are included in supplemental material.221 221
Fig. 3 shows a visualization of the types of user response results we get on the222 222
Birds-200 dataset. It should be noted that the uncertainty of the user responses223 223
strongly correlates with the parts that are visible in an image as well as overall224 224
difficulty of the corresponding bird species.225 225
When evaluating performance, test results are generated by randomly select-226 226
ing a response returned by an MTurk user for the appropriate test image.227 227
4.2 Animals With Attributes228 228
We also tested performance on the Animals With Attributes (AwA) [23], a229 229
dataset of 50 animal classes and 85 binary attributes. We consider this dataset230 230
less relevant than birds (because classes are not tightly related), and therefore231 231
do not focus as much on this dataset.232 232
4.3 Implementation Details and Parameter Settings233 233
For both datasets, our computer vision algorithms are based on Andrea Vedaldi’s234 234
publicly available source code [28], which combines vector-quantized geometric235 235
blur and color/gray SIFT features using spatial pyramids, multiple kernel learn-236 236
ing, and per-class 1-vs-all SVMs. We added additional features based on full237 237
image color histograms and vector-quantized color histograms. For each classi-238 238
fier we used Platt scaling [29] to learn parameters for p(c|x) on a validation set.239 239
We used 15 training examples for each Birds-200 class and 30 training examples240 240
for each AwA class. Bird training and testing images are roughly cropped.241 241
Additionally, we compare performance to a second computer vision algorithm242 242
based on attribute classifiers, which we train using the same features/training243 243
code, with positive and negative examples set using whatbird.com attribute la-244 244
bels. We combined attribute classifiers into per-class probabilities p(c|x) using245 245
the method described in [23].246 246
For estimating user response statistics on the Birds-200 dataset, we used247 247
αguess = 64, αprob = 16, αdef = 8, and αc = 8 (see Section 3.2).248 248
5 Experiments249 249
In this section, we provide experimental results and analysis of the hybrid-human250 250
computer classification paradigm. Due to space limitations, our discussion fo-251 251
cuses on the Birds dataset. We include results (see Fig. 9) from which the user252 252
can verify that trends are similar on Birds-200 and AwA, and we include addi-253 253
tional results on AwA in the supplementary material.254 254
5.1 Measuring Performance255 255
We use two main methodologies for measuring performance, which correspond256 256
to two different possible user-interfaces:257 257
ECCV-10 submission ID *** 9
0 10 20 30 40 50 600
0.2
0.4
0.6
0.8
1
Number of Binary Questions Asked
Per
cent
Cla
ssifi
ed C
orre
ctly
Deterministic Users
MTurk Users
MTurk Users + Model
Rose‐breasted Grosbeak
Q: Is the belly red? yes (Def)Q: Is the breast black? yes (Def.)Q : Is the primary color red? yes (Def.)
Fig. 5. Different Models of User Responses: Left: Classification performance onBirds-200 (Method 1) without computer vision. Performance rises quickly (blue curve)if users respond deterministically according to whatbird.com attributes. MTurk usersrespond much differently, resulting in low performance (green curve). A learned modelof MTurk responses is much more robust (red curve). Right: A test image where usersanswer several questions incorrectly and our learned model still classifies the imagecorrectly.
– Method 1: We ask the user exactly T questions, predict the class with258 258
highest probability, and measure the percent of the time that we are correct.259 259
– Method 2: After asking each question, we present to the user a small gallery260 260
of images of the class with highest probability and assume that the user261 261
will stop the system when presented with the correct class. In this case, we262 262
measure the average number of questions asked per test image.263 263
For the second method, we assume that people are perfect verifiers, e.g., they264 264
will stop the system if and only if they have been presented with the correct265 265
class. While this is not always possible in reality, there is some trade-off between266 266
classification accuracy and amount of human labor, and we believe that these267 267
two metrics collectively capture the most important considerations.268 268
5.2 Results269 269
User Responses are Stochastic: In this section, we present our results and270 270
discuss some interesting trends toward understanding the visual 20 questions271 271
classification paradigm.272 272
In Fig. 5, we show the effects of different models of user responses without273 273
using any computer vision. When users are assumed to respond deterministically274 274
in accordance with the attributes from whatbird.com, performance rises quickly275 275
to 100% within 8 questions (roughly log2(200)). However, this assumption is276 276
not realistic; when testing with responses from Mechanical Turk, performance277 277
saturates at around 5%. Low performance caused by subjective answers are278 278
unavoidable (e.g., perception of the color brown vs. the color buff), and the279 279
probability of the correct class drops to zero after any inconsistent response.280 280
Although performance is 10 times better than random chance, it renders the281 281
10 ECCV-10 submission ID ***
0 10 20 30 40 500
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Number of Binary Questions Asked
Per
cent
Cla
ssifi
ed C
orre
ctly
No CV
1−vs−all
Attribute
0 2 4 6 8 10 12 14 160
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Number of Binary Questions Asked
Per
cent
of T
ests
et Im
ages
No CV (10.64)1−vs−all (5.84)Attribute (6.29)
Fig. 6. Performance on Birds-200 when using computer vision: Left Plot:comparison of classification accuracy (Method 1) with and without computer visionwhen using MTurk user responses. Two different computer vision algorithms are shown,one based on per-class 1-vs-all classifiers and another based on attribute classifiers.Right plot: the number of questions needed to identify the true class (Method 2) dropsfrom 10.64 to 5.84 on average when incorporating computer vision.
system useless. This demonstrates a challenge for existing field guide websites282 282
in helping lay-people identify bird species. When our learned model of user re-283 283
sponses (see Section 3.2) is incorporated, performance jumps to 70% due to the284 284
ability to tolerate a reasonable degree of error in user responses (see Fig. 5 for an285 285
example). Nevertheless, stochastic user responses increase the number of ques-286 286
tions required to achieve a given accuracy level, and some images can never be287 287
classified correctly, even when asking all possible questions. In Section 5.2, we288 288
discuss the reasons why performance saturates at lower than 100% performance.289 289
Computer Vision Reduces Manual Labor: The main benefit of computer290 290
vision occurs due to reduction in human labor (in terms of the number of ques-291 291
tions a user has to answer). In Fig. 6, we see that computer vision reduces the292 292
average number of yes/no questions needed to identify the true bird species from293 293
10.64 to 5.84 using responses from MTurk users. Without computer vision, the294 294
distribution of question counts is bell-shaped and centered around 7 questions.295 295
When computer vision is incorporated, the distribution peaks at 0 questions but296 296
is more heavy-tailed, which suggests that computer vision algorithms are often297 297
good at recognizing the “easy” test examples (examples that are sufficiently sim-298 298
ilar to the training data), but provide diminishing returns toward classifying the299 299
harder examples that are not sufficiently similar to training data. As a result,300 300
computer vision is more effective at reducing the average amount of time nec-301 301
essary to classify an image than reducing the time spent on the most difficult302 302
images.303 303
User Responses Drive Up Performance: An alternative way of interpret-304 304
ing the results is that user responses drive up the accuracy of computer vision305 305
algorithms. In Fig. 6, we see that user responses improve overall performance306 306
from ≈ 27% (using 0 questions) to ≈ 72%.307 307
ECCV-10 submission ID *** 11
Western GrebeRose‐breasted Grosbeak
Yellow‐headed Blackbird
Only CV
w/ vision:Q #1: Is the throat white? yes (Def.) w/o vision:
CV + Q #1: Is the crown black? yes (Def.)
Rose‐breasted /
Q #1: Is the shape perching‐like? no (Def.) Grosbeak
Western GrebeRose‐breasted Grosbeak
Yellow‐headed Blackbird
Only CV
w/ vision:Q #1: Is the throat white? yes (Def.) w/o vision:
CV + Q #1: Is the crown black? yes (Def.)
Rose‐breasted /
Q #1: Is the shape perching‐like? no (Def.) Grosbeak
Fig. 7. Examples where computer vision and user responses work together:Left: An image that is only classified correctly when computer vision is incorporated.Additionally, the computer vision based method selects the question HasThroatColor-White, a different and more relevant question than when vision is not used. In the rightimage, the user response to HasCrownColorBlack helps correct computer vision whenits initial prediction is wrong.
Computer Vision Improves Overall Performance: Even when users an-308 308
swer all questions, performance saturates at a higher level when using computer309 309
vision (≈ 72% vs. ≈ 67%, see Fig. 6). The left image in Fig. 7 shows an example310 310
of an image classified correctly using computer vision, which is not classified cor-311 311
rectly without computer vision, even after asking 60 questions. In this example,312 312
some visually salient features like the long neck are not captured in our list of313 313
visual attribute questions. The features used by our vision algorithms also cap-314 314
ture other cues (such as global texture statistics) that are not well-represented315 315
in our list of attributes (which capture mostly color and part-localized patterns).316 316
Different Questions Are Asked With and Without Computer Vision:317 317
In general, the information gain criterion favors questions that 1) can be an-318 318
swered reliably, and 2) split the set of possible classes roughly in half. Questions319 319
like HasShapePerchingLike, which divide the classes fairly evenly, and HasUnder-320 320
partsColorYellow, which tends to be answered reliably, are commonly chosen.321 321
When computer vision is incorporated, the likelihood of classes change and322 322
different questions are selected. In the left image of Fig. 7, we see an example323 323
where a different question is asked with and without computer vision, which324 324
allows the computer vision based method to hone in on the correct class using325 325
one question.326 326
Recognition is Not Always Successful: According the the Cornell Ornithol-327 327
ogy Website [30], the four keys to bird species recognition are 1) size and shape,328 328
2) color and pattern, 3) behavior, and 4) habitat. Bird species classification is329 329
a difficult problem and is not always possible using a single image. One po-330 330
tential advantage of the visual 20 questions paradigm is that other contextual331 331
sources of information such as behavior and habitat can easily be incorporated332 332
as additional questions.333 333
12 ECCV-10 submission ID ***
Least Auklet Sayornis Gray KingbirdParakeet Auklet
Q : Is the belly multi‐colored? yes (Def.)
Fig. 8. Images that are misclassified by our system: Left: The Parakeet Aukletimage is misclassified due to a cropped image, which causes an incorrect answer to thebelly pattern question (the Parakeet Auklet has a plain, white belly, see Fig. 2). Right:The Sayornis and Gray Kingbird are commonly confused due to visual similarity.
Fig. 8 illustrates some example failures. The most common failure conditions334 334
occur due to 1) classes that are nearly visually identical, 2) images of a poor335 335
viewpoint or low resolution where some parts are not visible, 3) significant mis-336 336
takes made by MTurkers, or 4) limitations in the particular set of attributes we337 337
selected.338 338
1-vs-all Vs. Attribute-Based Classification: In general, 1-vs-all classifiers339 339
slightly outperform attribute-based classifiers; however, they converge to similar340 340
performance as the number of question increases, as shown in Fig. 6 and 9. The341 341
features we use (kernelized and based on bag-of-words) may not be well suited342 342
to the types of attributes we are using, which tend to be localized and associated343 343
with a particular part. One potential advantage of attribute-based methods is344 344
computational scalability when the number of classes increases; whereas 1-vs-345 345
all methods always require C classifiers, the number of attribute classifiers can346 346
be varied in order to trade-off accuracy and computation time. The table below347 347
displays the average number of questions needed (Method 1) on the Birds dataset348 348
using different number of attribute classifiers (which were selected randomly):349 349
200 (1-vs-all) 288 attr. 100 attr. 50 attr. 20 attr. 10 attr.5.84 6.29 6.49 7.37 8.76 9.40
350 350
6 Conclusion351 351
Object recognition remains an incredibly challenging problem for computer vi-352 352
sion. Furthermore, recognizing tightly related categories in one shot is difficult353 353
even for humans without proper expertise. Our work attempts to leverage the354 354
power of both human recognition abilities and that of computer vision. We pre-355 355
sented a simple way of designing a hybrid human-computer classification system,356 356
which can be used in conjunction with a large variety of computer vision algo-357 357
rithms. Our results show that user input significantly drives up performance;358 358
ECCV-10 submission ID *** 13
0 2 4 6 8 10 12 14 16 18 200
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Number of Binary Questions Asked
Per
cent
Cla
ssifi
ed C
orre
ctly
No CV1−vs−allAttribute
0 2 4 6 8 10 12 14 16 180
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Number of Binary Questions Asked
Per
cent
of T
ests
et Im
ages
No CV (5.95)1−vs−all (4.31)Attribute (4.11)
Fig. 9. Performance on Animals With Attributes: Left Plot: Classification per-formance (Method 1), simulating user responses using soft class-attributes (see [23]).Right Plot: The required number of questions needed to identify the true class (Method2) drops from 5.94 to 4.11 on average when incorporating computer vision.
while it may take many years before object recognition algorithms achieve rea-359 359
sonable performance on their own, incorporating human input can produce us-360 360
able recognition systems. On the other hand, having computer vision in the loop361 361
reduces the amount of required human labor to successfully classify an image.362 362
Finally, we showed that incorporating models of stochastic user responses leads363 363
to much better reliability in comparison to deterministic field guides generated364 364
by experts.365 365
We believe our work opens the door to many interesting sub-problems. The366 366
most obvious next step is to explore other types of super-categories. While we367 367
were able to extract a set of reasonable attributes/questions for the bird dataset,368 368
this may be more difficult for other domains; one possible future work is to369 369
find a more principled way of discovering a set of useful questions. Alternative370 370
types of user input, such as asking the user to click on the location of certain371 371
parts, could also be investigated. Lastly, while we used off-the-shelf computer372 372
vision algorithms in this work, it may be possible to improve them to better suit373 373
the challenges of tightly-related category recognition, such as algorithms that374 374
incorporate a part-based model.375 375
References376 376
1. Griffin, G., Holub, A., Perona, P.: Caltech-256 object category dataset. Technical377 377
Report 7694, California Institute of Technology (2007)378 378
2. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The379 379
PASCAL VOC Challenge 2009 Results. (http://www.pascal-network.org/380 380
challenges/VOC/voc2009/workshop/index.html)381 381
3. Nister, D., Stewenius, H.: Scalable recognition with a vocabulary tree. In: CVPR.382 382
(2006)383 383
4. Nilsback, M., Zisserman, A.: Automated flower classification over a large number of384 384
classes. In: Indian Conf. on Comp. Vision, Graphics & Image Proc. (2008) 722–729385 385
5. Lazebnik, S., Schmid, C., Ponce, J.: A maximum entropy framework for part-based386 386
texture and object recognition. In: ICCV. Volume 1. (2005) 832–838387 387
14 ECCV-10 submission ID ***
6. Martınez-Munoz, et al.: Dictionary-free categorization of very similar objects via388 388
stacked evidence trees. In: CVPR. (2009)389 389
7. Belhumeur, P., Chen, D., Feiner, S., Jacobs, D., Kress, W., Ling, H., Lopez, I.,390 390
Ramamoorthi, R., Sheorey, S., White, S., Zhang, L.: Searching the world’s herbaria:391 391
A system for visual identification of plant species. In: ECCV. (2008) 116–129392 392
8. Zhou, X., Huang, T.: Relevance feedback in image retrieval: A comprehensive393 393
review. Multimedia Systems 8 (2003) 536–544394 394
9. Tao, D., Tang, X., Li, X., Wu, X.: Asymmetric bagging and random subspace for395 395
support vector machines-based relevance feedback in image retrieval. PAMI 28396 396
(2006) 1088–1099397 397
10. Tong, S., Koller, D.: Support vector machine active learning with applications to398 398
text classification. JMLR 2 (2002) 45–66399 399
11. Kapoor, A., Grauman, K., Urtasun, R., Darrell, T.: Active learning with gaussian400 400
processes for object categorization. In: ICCV. (2007) 1–8401 401
12. Holub, A., Perona, P., Burl, M.: Entropy-based active learning for object recogni-402 402
tion. In: Workshop on Online Learning for Classification (OLC). (2008) 1–8403 403
13. Neapolitan, R.E.: Probabilistic reasoning in expert systems: theory and algorithms.404 404
John Wiley & Sons, Inc., New York, NY, USA (1990)405 405
14. Beynon, M., Cosker, D., Marshall, D.: An expert system for multi-criteria decision406 406
making using Dempster Shafer theory. Expert Systems with Applications 20 (2001)407 407
15. Tsang, S., Kao, B., Yip, K., Ho, W., Lee, S.: Decision trees for uncertain data. In:408 408
International Conference on Data Engineering (ICDE). (2009)409 409
16. Elouedi, Z., Mellouli, K., Smets, P.: Belief decision trees: theoretical foundations.410 410
International Journal of Approximate Reasoning 28 (2001) 91–124411 411
17. Jeon, J., Manmatha, R.: Using Maximum Entropy for Automatic Image Annota-412 412
tion. In: International Conference on Image and Video Retrieval (CIVR). (2004)413 413
18. Quinlan, J.R.: C4.5: programs for machine learning. Morgan Kaufmann Publishers414 414
Inc., San Francisco, CA, USA (1993)415 415
19. Sivic, J., Russell, B., Zisserman, A., Freeman, W., Efros, A.: Unsupervised discov-416 416
ery of visual object class hierarchies. In: CVPR. (2008) 1–8417 417
20. Griffin, G., Perona, P.: Learning and using taxonomies for fast visual categoriza-418 418
tion. In: CVPR. (2008) 1–8419 419
21. Torralba, A., Murphy, K., Freeman, W.: Sharing features: efficient boosting pro-420 420
cedures for multiclass object detection. In: CVPR. Volume 2. (2004)421 421
22. Dietterich, T., Bakiri, G.: Solving multiclass learning problems via error-correcting422 422
output codes. Journal of Artificial Intelligence Research 2 (1995) 263–286423 423
23. Lampert, C., Nickisch, H., Harmeling, S.: Learning to detect unseen object classes424 424
by between-class attribute transfer. In: CVPR. (2009)425 425
24. Farhadi, A., Endres, I., Hoiem, D., Forsyth, D.: Describing objects by their at-426 426
tributes. In: CVPR. (2009)427 427
25. Kumar, N., Berg, A.C., Belhumeur, P.N., Nayar, S.K.: Attribute and Simile Clas-428 428
sifiers for Face Verification. In: ICCV. (2009)429 429
26. Von Ahn, L., Maurer, B., McMillen, C., Abraham, D., Blum, M.: Recaptcha:430 430
Human-based character recognition via web security measures. Science 321 (2008)431 431
27. Waite, M.: whatbird.com. (http://www.whatbird.com/)432 432
28. Vedaldi, A., Gulshan, V., Varma, M., Zisserman, A.: Multiple kernels for object433 433
detection. In: ICCV. (2009)434 434
29. Platt, J.: Probabilities for SV machines. In: NIPS. (1999) 61–74435 435
30. : www.allaboutbirds.org. (http://www.allaboutbirds.org/NetCommunity/page.436 436
aspx?pid=1053)437 437