9
When Was That Made? Sirion Vittayakorn Alexander C. Berg Tamara L. Berg University of North Carolina at Chapel Hill sirionv, aberg, [email protected] Abstract In this paper, we explore deep learning methods for esti- mating when objects were made. Automatic methods for this task could potentially be useful for historians, collectors, or any individual interested in estimating when their artifact was created. Direct applications include large-scale data organization or retrieval. Toward this goal, we utilize fea- tures from existing deep networks and also fine-tune new networks for temporal estimation. In addition, we create two new datasets of 67,771 dated clothing items from Flickr and museum collections. Our method outperforms both a color-based baseline and previous state of the art methods for temporal estimation. We also provide several analyses of what our networks have learned, and demonstrate appli- cations to identifying temporal inspiration in fashion col- lections. 1. Introduction Immense progress has been made on methods to cat- egorize image and video content. From recognizing ob- jects [20, 12], to scenes [38], to activities [14], automatic techniques are beginning to successfully answer the ques- tion of what is shown in an image or video quite well. Re- cently, some computational methods have begun to look at other aspects of the recognition problem, including recog- nition of where [16], or when an image was captured [24]. In this paper, we explore the related problem of estimat- ing when an object was made from photographs of the ob- ject. Effective automatic methods for this task could poten- tially be useful for historical dating at large scale by histori- ans and collectors, or in more everyday scenarios to identify the vintage of objects. For example, online resellers like eBay and Etsy sell hundreds of thousands of vintage furni- ture, clothing, and jewelry items. An automatic method for dating man-made objects could be quite useful for organiz- ing and retrieving from these large disorganized collections. The temporal periods of objects tend to be indicated by particular stylistic elements, e.g., the shapes of headlights on cars, or the wide bell-bottoms of pants from the 60s. As such, previous methods for temporal estimation have taken a hand-crafted data mining approach to the problem, sifting through patches to discover mid-level visual patterns that correlate with time periods [21]. We take an alternative, purely discriminative, approach to the problem using deep- learning based methods and achieve greater accuracy than previous methods. Our first method trains discriminative models on top of existing CNN features and yields the aver- age mean absolute error (MAE) of 7.55 ± 0.51 years com- pared to the previous state of the art of 8.56 years in [21] on their dataset of car images. Furthermore, by adapt- ing and fine-tuning pre-tained networks to directly estimate the time, we improve performance to 3.97 years. We also demonstrate the same procedure to train a deep network for temporal estimation of clothing on two new datasets col- lected for this paper, consisting of 67K images of dated clothing items from the time period of 1900 through 2009. To our knowledge this is the first time deep networks have been used to estimate the time period of objects. As time has a natural ordering, there are additional opportu- nities for introspection into the structure learned by the deep network that are not available for arbitrary classifi- cation tasks. To understand what the deep networks are learning, we perform several analyses. First, we demon- strate that fine-tuning the pre-trained network for temporal estimation of objects causes some nodes to become signif- icantly more selective to narrow time periods (we quantify this over slices of the network using entropy histograms). We also perform analyses to compare the structures that the deep networks have learned to the previous mid-level dis- covery approach [21]. Finally, we demonstrate the appli- cability of our models to identifying temporal influence in fashion collections. In summary, our contributions are: 1) Deep learning ap- proaches to estimate when an object was made, 2) Two new datasets of 67,771 dated photographs of clothing items made between 1900 and 2009; one collected from Flickr, and the other collected from museums, 3) Analyses of what the fine-tuned temporal estimation networks have learned, and comparison to the mid-level patterns learned by the previous state of the art approach [21], and d) Application of temporal estimation to analyze the influence of vintage arXiv:1608.03914v1 [cs.CV] 12 Aug 2016

When Was That Made? - arXiv · 2016. 8. 16. · When Was That Made? Sirion Vittayakorn Alexander C. Berg Tamara L. Berg University of North Carolina at Chapel Hill sirionv, aberg,

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: When Was That Made? - arXiv · 2016. 8. 16. · When Was That Made? Sirion Vittayakorn Alexander C. Berg Tamara L. Berg University of North Carolina at Chapel Hill sirionv, aberg,

When Was That Made?

Sirion Vittayakorn Alexander C. Berg Tamara L. BergUniversity of North Carolina at Chapel Hill

sirionv, aberg, [email protected]

Abstract

In this paper, we explore deep learning methods for esti-mating when objects were made. Automatic methods for thistask could potentially be useful for historians, collectors, orany individual interested in estimating when their artifactwas created. Direct applications include large-scale dataorganization or retrieval. Toward this goal, we utilize fea-tures from existing deep networks and also fine-tune newnetworks for temporal estimation. In addition, we createtwo new datasets of 67,771 dated clothing items from Flickrand museum collections. Our method outperforms both acolor-based baseline and previous state of the art methodsfor temporal estimation. We also provide several analysesof what our networks have learned, and demonstrate appli-cations to identifying temporal inspiration in fashion col-lections.

1. IntroductionImmense progress has been made on methods to cat-

egorize image and video content. From recognizing ob-jects [20, 12], to scenes [38], to activities [14], automatictechniques are beginning to successfully answer the ques-tion of what is shown in an image or video quite well. Re-cently, some computational methods have begun to look atother aspects of the recognition problem, including recog-nition of where [16], or when an image was captured [24].

In this paper, we explore the related problem of estimat-ing when an object was made from photographs of the ob-ject. Effective automatic methods for this task could poten-tially be useful for historical dating at large scale by histori-ans and collectors, or in more everyday scenarios to identifythe vintage of objects. For example, online resellers likeeBay and Etsy sell hundreds of thousands of vintage furni-ture, clothing, and jewelry items. An automatic method fordating man-made objects could be quite useful for organiz-ing and retrieving from these large disorganized collections.

The temporal periods of objects tend to be indicated byparticular stylistic elements, e.g., the shapes of headlightson cars, or the wide bell-bottoms of pants from the 60s. Assuch, previous methods for temporal estimation have taken

a hand-crafted data mining approach to the problem, siftingthrough patches to discover mid-level visual patterns thatcorrelate with time periods [21]. We take an alternative,purely discriminative, approach to the problem using deep-learning based methods and achieve greater accuracy thanprevious methods. Our first method trains discriminativemodels on top of existing CNN features and yields the aver-age mean absolute error (MAE) of 7.55 ± 0.51 years com-pared to the previous state of the art of 8.56 years in [21]on their dataset of car images. Furthermore, by adapt-ing and fine-tuning pre-tained networks to directly estimatethe time, we improve performance to 3.97 years. We alsodemonstrate the same procedure to train a deep network fortemporal estimation of clothing on two new datasets col-lected for this paper, consisting of 67K images of datedclothing items from the time period of 1900 through 2009.

To our knowledge this is the first time deep networkshave been used to estimate the time period of objects. Astime has a natural ordering, there are additional opportu-nities for introspection into the structure learned by thedeep network that are not available for arbitrary classifi-cation tasks. To understand what the deep networks arelearning, we perform several analyses. First, we demon-strate that fine-tuning the pre-trained network for temporalestimation of objects causes some nodes to become signif-icantly more selective to narrow time periods (we quantifythis over slices of the network using entropy histograms).We also perform analyses to compare the structures that thedeep networks have learned to the previous mid-level dis-covery approach [21]. Finally, we demonstrate the appli-cability of our models to identifying temporal influence infashion collections.

In summary, our contributions are: 1) Deep learning ap-proaches to estimate when an object was made, 2) Twonew datasets of 67,771 dated photographs of clothing itemsmade between 1900 and 2009; one collected from Flickr,and the other collected from museums, 3) Analyses of whatthe fine-tuned temporal estimation networks have learned,and comparison to the mid-level patterns learned by theprevious state of the art approach [21], and d) Applicationof temporal estimation to analyze the influence of vintage

arX

iv:1

608.

0391

4v1

[cs

.CV

] 1

2 A

ug 2

016

Page 2: When Was That Made? - arXiv · 2016. 8. 16. · When Was That Made? Sirion Vittayakorn Alexander C. Berg Tamara L. Berg University of North Carolina at Chapel Hill sirionv, aberg,

styles on fashion.The rest of our paper is organized as follows. First, we

review related works (Sec 2). Then we describe three datedobject datasets, one existing, and two novel datasets (Sec 3).Next, we describe our deep learning based approaches to es-timate when objects were made (Sec 4) and evaluate thesemodels on the temporal estimation task (Sec 5). Then, wepresent several analyses of what our networks have learnedand how our approach compares to a data mining approachto temporal prediction (Sec 6). Finally, we apply our tempo-ral estimation approach to explore the influence of vintagefashion on fashion show images (Sec 7).

2. Related workWe review work in 3 areas related to our research: deep

learning for visual recognition, visual analyses of deep net-works, and visual data mining.Deep Learning: Convolutional neural networks (CNNs)have been one of the driving forces toward improving per-formance of visual recognition algorithms. These deeplearning approaches, originally introduced as perceptronsin the 1950s [28], experienced a resurgence in popular-ity during the 1990s, and then in recent years have almosttaken over the recognition community after demonstrationsof remarkable image classification [20, 31, 17] and detec-tion [33] performance on benchmark tasks. We make useof both networks, from [20, 31], trained for classification ofthe 1.2 million labeled images in ImageNet [5]. These net-works demonstrate better performance than the best hand-crafted features on the ImageNet Large Scale Visual Recog-nition Challenge (ILSVRC) [29].

The feature representations learned by these networkson ImageNet data have been shown to generalize well toother image classification tasks [7] as well as related taskssuch as object detection [13, 30], pose estimation and ac-tion detection [14], or fine-grained category detection [36].Moreover, in a somewhat related task to ours, S. Karayevet al. [19] show that using a pre-trained network [7] asa generic feature extractor, produces a better classifier forphoto and painting style than hand-crafted features. We arenot aware of any prior work on modeling historical visualstyle using deep-learning based methods.Analysis of CNNs: Unlike hand-crafted features such asSIFT [23] or HOG [4], the representation learned by CNNsis not obviously interpretable. For many tasks the CNN isused as a black-box algorithm where it is not always clearwhy the CNN is outperforming previous approaches. Sev-eral recent works have attempted to peer into this box, tobetter understand the representations learned by CNNs. P.Fischer et al. [11] compares the learned representation withSIFT in a descriptor matching task. M.D. Zeiler et al. [35]propose several heuristic visualization techniques for unitsin the network. J. Long et al. [22] study the effectiveness

of convnet activation features for tasks requiring correspon-dence. Recently, B Zhou et al. [37] present a techniqueto visualize learned representations of each unit in the net-work. Here they focus on a large dataset of scene imagesand show that object detection is embedded in the networkas a result of learning. We use variants of this approach toevaluate what our temporal networks have learned.Visual Data Mining: Several visual data mining ap-proaches have been used for tasks such as unsuperviseddiscovery of object categories in image collections [15, 25,9, 32], or for finding discriminative parts of actions [26],cities [6], or objects [21]. There has also been related workon discovering localized attributes [2, 8, 27] for improvedfine-grained recognition. Most of these approaches startfrom existing hand-crafted feature representations and pre-defined measures of visual similarity, then based on dis-covered patterns of features in the data, group visual el-ements into discovered entities. The most relevant workis from [21] which go beyond simply detecting recurringvisual elements to model the stylistic differences betweenobjects over time or space. Unlike this prior work, we donot use hand-crafted visual representations. Instead, we userepresentations learned by CNN models and achieve sig-nificantly improved performance. Additionally, we analyzewhat the network has learned, temporally, and in compari-son to the existing data mining approach [21].

3. DatasetsWe use two main data sources in our work: a) the Car

Database (CarDb) collected by [21], and b) a large new col-lection of clothing photographs with associated dates, col-lected from Flickr. Since the date information in the Flickrclothing dataset is provided by the user, the dates associatedwith objects can sometimes be noisy. Therefore, we alsocollect a smaller dataset of clothing related photographsfrom 2 different museum collections. Because these pho-tographs have been curated by experts, their date labels arequite reliable. We use this dataset as an alternative test setto evaluate clothing date models trained on the larger Flickrclothing dataset. As this Museum dataset has a somewhatdifferent domain than the Flickr images, we can also use itevaluate model generalization, i.e. training on one datasetand testing on another.Car Database: The Car Database (CarDb) [21] contains13,474 photos of cars made from 1920 to 1999 resulting8 temporal classes, collected from www.cardatabase.net. The first row of Figure 1 shows example images fromCarDb, sorted by time.Flickr Clothing Dataset: We initially collect more than100,000 clothing related images together with their corre-sponding meta-data from a wide variety of 50 Flickr groupsfocused on vintage fashions, e.g. “Fashions Past - Best andWorst” and “As She Was”. Some of these images contain

Page 3: When Was That Made? - arXiv · 2016. 8. 16. · When Was That Made? Sirion Vittayakorn Alexander C. Berg Tamara L. Berg University of North Carolina at Chapel Hill sirionv, aberg,

Figure 1: Examples images from CarDb (first row), Flickr clothing dataset (second row) and Museum dataset (third row),sorted by time.

drawings of fashion items or depict other fashion relatedimages without clear examples of clothing items. To re-move these images we apply a face detection algorithm [39]to automatically filter out images without a depicted per-son. The remaining images are manually inspected to re-move additional non-photographic content such as artwork,painting, and advertisements. From the meta-data such astitle, description, and tags we automatically extract a poten-tial decade label such as 90s, 1965, 1954-1957, 1920s etc.Then, we quantized these labels into an 11-bin histogramwith dates ranging from 1900-2009. Finally, a datasetcontains 58,350 clothing photographs with correspondingmeta-data, including photo id, user id, title, description,tags, longitude, latitude, number of views and groups, etc.Some example images from our new Flickr clothing dataset,sorted by time, are shown in the second row of Figure 1.

Museum Dataset: As the temporal labels of the Flickrdataset have been associated with images by Flickr users,we also collect an additional dataset that has been labeledby expert museum curators from 2 different museums; theMetropolitan Museum of Art, and Europeana Fashion. Eu-ropeana Fashion is a museum network co-funded by 22partners from 12 European countries which represent lead-ing European institutions and some of the largest collectionsin the fashion domain. The Museum dataset has 9,421 im-ages between 1900 to 2009, showing clothing worn on peo-ple. We use this dataset as a test set for models trained onthe larger Flickr clothing dataset. Examples of clothing im-ages from the Museum dataset, sorted by time, are shownin the third row of Figure 1.

4. Approaches

We pursue two approaches for temporal estimation.First, we evaluate classifiers trained on features from a pre-trained network. Then, we explore adapting the network todirectly predict time period, fine-tuning for this task.

Pre-trained models: We start with two Convolutional Neu-

ral Network (CNN) models pre-trained on 1.2 million la-beled images from ImageNet. The first network, AlexNet,was originally described by [20] and the latter, VGG, wasdescribed by [31], both networks are implemented as partof the Caffe framework [18]. For each network, we extractthe learned representation from the second fully-connectedlayer and use this 4096 dimensional vector as our visual rep-resentation. We experiment with two different classificationmethods: a linear Support Vector Machines (SVM) [10]with fixed Csvm = 0.1, and Support Vector Regressors(SVR) [3] with fixed ε = 0.1 and set Csvr = 100.

Fine-tuned models: Network fine-tuning has proven suc-cessful for adapting networks to new tasks beyond theiroriginal purpose such as object detection [12], pose esti-mation and action detection [14], or fine-grained categorydetection [36]. In this work, we are interested in fine-tuningthe original object classification model for the temporal es-timation task.

From each pre-trained model, described by [20, 31], wefine-tune 3 different models. The first model, the fine-tuned Car model, is fine-tuned on 10,130 training imagesand tested on 3,343 images from the CarDb dataset (sametrain/test split as [21]). The second model, the fine-tunedClothing model, uses 3/4 of the images from the Flickrclothing dataset as training set. Moreover, since our datasetsdepict vintage photographs from 1900-2009, historic colorcues, e.g black and white color or sepia tones in early pho-tos and color photographs in later photos, appear in bothdatasets. To evaluate performance without any color cues,we also fine-tune a third network using only black/whiteimages from the Flickr clothing dataset. Finally, both mod-els are evaluated during testing on the portion of the Flickrclothing dataset and on the Museum dataset. To fine-tuneeach model, we change the output of the last fully con-nected layer from 1000 classes to 11 temporal classes (oneper decade) for the Flickr clothing dataset and 8 temporalclasses for CarDb. We trained our models using stochastic

Page 4: When Was That Made? - arXiv · 2016. 8. 16. · When Was That Made? Sirion Vittayakorn Alexander C. Berg Tamara L. Berg University of North Carolina at Chapel Hill sirionv, aberg,

CarDb Clothing MuseumLee et al. [21] 8.56 17.74 19.56Palermo et al. [24] - 17.21 21.21alexNet + SVM 7.77 12.99 17.33alexNet + SVR 8.10 16.73 22.00VGG-16 + SVM 6.90 12.60 16.35VGG-16 + SVR 7.43 15.87 18.76alexNet (BW/FT) 6.78 17.16 17.96alexNet (FT) 6.17 12.88 16.43VGG-16 (BW/FT) 4.27 13.66 16.40VGG-16 (FT) 3.97 11.54 14.23

Table 1: The mean absolute error (years) training and test-ing on CarDb dataset and training on Flickr clothing datasetand testing on held out clothing and Museum dataset.

gradient descent with a batch size of 50 examples, momen-tum of 0.9, weight decay of 0.0005 and decrease the learn-ing rate of the models to 0.00001. Finally, we trained thenetworks for 50,000 cycles.

5. Temporal Estimation ExperimentsWe evaluate performance of our models compared to Lee

et al. [21] and F. Palermo et al. [24].Pre-trained Model Performance: For the pre-trainedmodels, we extract features from last fully-connected layer,then train SVM and SVR models to estimate when an ob-ject was made. We evaluate performance on both the carand clothing datasets. Table 1 shows mean absolute error(MAE) in years across features and classifiers, includingcomparisons to previous works [21] on all datasets. Thoughwe are relying on pre-trained models in this approach, wealready outperform the previous state of the art on bothdatasets. Moreover, for clothing, the results show that eventhough the domain shift is quite evident (the results on theMuseum dataset are worse than results on the held out por-tion of the Flickr clothing dataset), the deep learning featureis beneficial for the temporal estimation task. Evaluationsachieve error reductions of 0.95±2.47 years and 3.19±2.06years on the Museum and Flickr clothing collections respec-tively compared to the baseline [21].Fine-tuned Model Performance: The fine-tuned mod-els consistently outperform the pre-trained models on bothobject categories. For cars, the MAE decreases around2.48±0.86 years. Similar trends apply for clothing, the fine-tuned model decrease MAE around 2.86±2.42 years on theMuseum dataset and 1.67± 1.94 years on the nosier Flickrclothing dataset. Moreover, the results confirm that the fine-tuned network learns visual elements beyond color by out-performing two color-based baselines. The first baseline, byF. Palermo et al. [24] which proposed temporally discrim-inative features related to the evolution of color imagingprocesses over time, achieves worse MAE on both datasets

0

100

200

300

400

500

600

700

800

900

1000

-4.0 -4.0 -3.9 -3.9 -3.8 -3.8 -3.8 -3.7 -3.7 -3.6 -3.6 -3.6 -3.5 -3.5 -3.4 -3.4 -3.3 -3.3 -3.3 -3.2 -3.2 -3.1 -3.1 -3.1 -3.0

Nu

mb

er

of

un

its

Entropy (x103) Pre-trained Fine-tuned

Histogram of unit temporal entropy from the first FC layer

(a) fully connected layer 6.

0

100

200

300

400

500

600

700

800

900

1000

-4.3 -4.3 -4.2 -4.2 -4.1 -4.1 -4.0 -4.0 -3.9 -3.9 -3.8 -3.7 -3.7 -3.6 -3.6 -3.5 -3.5 -3.4 -3.4 -3.3 -3.3 -3.2 -3.1 -3.1 -3.0

Nu

mb

er

of

un

its

Entropy (x103) Pre-trained Fine-tuned

Histogram of unit temporal entropy from the second FC layer

(b) fully connected layer 7.

Figure 2: Histogram of unit temporal entropy from 2 differ-ent fully connected layers from alexnet(blue) and fine-tunednetwork(orange).

compare to our fine-tuned networks. Finally, we also com-pare our approach with a second baseline, where we fine-tune the network using only black/white images from Flickrclothing dataset. The results show that the B/W modelachieves about 2.53 ± 1.2 years higher MAE in this task.These results emphasize that even though color is an impor-tant clue, our fine-tuned network is able to learn temporallysensitive features of an object beyond color.

6. Deep Network AnalysesBased on these quantitative results, we find that deep

learning methods are promising for temporal estimationtask. However, unlike the patch discovery methods [21], thelearned representations from deep networks are not imme-diately interpretable. Thus, in this section, we provide someanalyses of the CNN networks to gain additional under-standing about what the fine-tuned networks have learned.

Page 5: When Was That Made? - arXiv · 2016. 8. 16. · When Was That Made? Sirion Vittayakorn Alexander C. Berg Tamara L. Berg University of North Carolina at Chapel Hill sirionv, aberg,

(a) 1920s (b) 1930s (c) 1940s (d) 1950s

(e) 1960s (f) 1970s (g) 1980s (h) 1990s

Figure 3: Top 6 images with maximum activation from the low entropy units in each decade. The green rectangle indicatesan image is from the same decade as the decade with the lowest entropy for a given unit, while the red indicates otherwise.

(a) 1900s (b) 1910s (c) 1930s

(d) 1950s (e) 1970s (f) 1990s

Figure 4: Top 4 images with maximum activation from the low entropy units in each decade. The green rectangle indicatesan image is from the same decade as the decade with the lowest entropy for a given unit, while the red indicates otherwise.

6.1. Temporally-sensitive units

Since the fine-tuned network outperforms the pre-trainednetwork, we hypothesize that there may be interesting dif-ferences in the network before and after fine-tuning. Onepotential difference that we investigate is the temporal sen-sitivity of units. To explore the temporal sensitivity ofunits in both networks, for each unit, images are rankedby their maximum activation. Then we bin the top N =500 maximum activation images into a temporal histogram,by decade, and compute the entropy of each histogram:E(u) = −

∑ni=1H(i) · log2H(i) where H(i) denotes the

histogram count for bin i and n denotes the number of quan-tized label bins. Lower entropy values indicate higher tem-poral sensitivity. Finally, we compute the entropy histogramof all units from 2 Fully-Connected layers as shown in Fig-ure 2a (first FC layer) and Figure 2b (second FC layer).Both histograms show that the fine-tuned network has morelow entropy units than the pre-trained network which indi-cate that units have been fine-tuned to capture a temporallydiscriminative feature for a specific time period. These re-sults are visible in both layers, but are more pronounced in

second layer, which makes intuitive sense since this layeris closest to the end temporal estimation. Qualitative ex-amples of top images ranked by their maximum activationfrom the low entropy units are shown in Figure 3 for theCarDb and Figure 4 for clothing dataset.

6.2. Unit activation analysis

Beyond more highly-tuned temporal sensitivity in units,we would like to understand whether the network haslearned to detect temporally discriminative object appear-ances or not. Thus, we investigate the unit activation pat-terns to better understand whether the temporally sensitiveregions correspond to semantic elements of objects. To doso, we follow the data-driven approach proposed by [37]to estimate the learned receptive fields (RFs) of units. Toestimate a unit’s RF, images are ranked by their maximumactivations for that unit, and the top K images are selectedto identify image regions that led to these high activations.To recover the high activation regions within an image, eachimage is replicated many times with a small occluder of size11x11 placed at one of about 5,500 locations in a dense grid(stride 3 pixels) in the images. Each occluded image is eval-

Page 6: When Was That Made? - arXiv · 2016. 8. 16. · When Was That Made? Sirion Vittayakorn Alexander C. Berg Tamara L. Berg University of North Carolina at Chapel Hill sirionv, aberg,

(a) 1920s (b) 1930s (c) 1940s (d) 1950s

(e) 1960s (f) 1970s (g) 1980s (h) 1990s

Figure 5: In each rectangle, 3 image regions, from 1920s to 1990s, with the maximum activations from 3 different units fromfc7 of the fine-tuned network on CarDb are shown.

(a) 1900s (b) 1910s (c) 1920s (d) 1930s

(e) 1940s (f) 1950s (g) 1960s (h) 1970s

Figure 6: Each rectangle shows 3 images with the maximum activation regions, in green highlight, of units which have thelowest entropy in each decade of the fine-tuned network on Flickr Clothing dataset.

uated by the same network and the change in activation ver-sus the original is calculated. Those differences are com-bined into a discrepancy map over the image. The intuitionbehind this approach is that if there is a large discrepancybetween activations before and after occlusion, then the oc-cluded region is important for activating that unit.

In order to investigate the most informative regions in animage, we focus on image regions which highly contributeto the prediction decision. Therefore, we first select all truepositive images from the fine-tuned network. For a givenimage, we rank the units based on their contribution to theprediction decision and assign the image to the top N units.For each unit, we compute the discrepancy map of assignedimages, following [37]. Figure 5- 6 show image regions that

caused the maximum activation for the given unit from thelast FC layer from CarDb and Flickr clothing dataset. Re-sults indicate that units in the fine-tuned network respond totemporally informative parts such as front bumpers, head-lights, or wheels for cars and cinched-in waists (40s-50s),mod dresses (60s) and leisure suits (70s) for clothing.

6.3. Discriminative part correlation

So far we have observed temporal sensitivity adaptationand indications that nodes in the network have tuned theiractivations to particular object parts. These results lead usto new interesting questions. Are the parts discovered bythe network discriminative in time? Do these visual ele-ments correspond to the style-sensitive elements discoveredby [21]?

Page 7: When Was That Made? - arXiv · 2016. 8. 16. · When Was That Made? Sirion Vittayakorn Alexander C. Berg Tamara L. Berg University of North Carolina at Chapel Hill sirionv, aberg,

(a) high correlation (b) poor correlation

Figure 7: For each block, the top two rows show style-sensitive patches from [21], while the bottom two rows show regionswith the maximum activation from the same images. While (a) shows unit activations which are highly correlated to style-sensitive patches, (b) shows unit activations which are poorly correlated to style-sensitive patches

0

2

4

6

8

10

12

14

16

18

20

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

# d

iscri

min

ative

-ma

x a

ctiva

tio

n p

atc

he

s

Average IOU score

conv fc

Histogram of average IOU score between discriminative patches

and maximum unit activation patches.

Figure 8: The average IoU score between style-sensitivepatches and maximum activation patches.

To identify correspondences between the visual elementslearned by our fine-tuned network and the style-sensitive el-ements proposed by [21], we search for units which have asimilar behavior as their style-sensitive detectors. We lookfor two behavioral factors: (1) high responses on similarsets of images, and (2) similar localization patterns. To im-plement this, we generate two image rankings; the first oneis based on the maximum activation for a given unit u overimages, and the other is based on the maximum detectorconfidence for a given detector d over images. Then, wedefined the correlation C between unit u and generic detec-tor d as C(u, d) = |An∩Dn|

n where An are the set of top nimages from the activation based ranking andDn are the setof top n images from the detection based ranking (n = 30%in all experiments). For a given detector, we rank all unitsin each layer by the correlation score.

We evaluate this correlation on units in the last convo-lutional layer (conv) and a second fully connected layer(fc). The average correlation scores between style-sensitivedetectors and their top 5 correlated units from conv andfc are 0.521 and 0.543 respectively, indicating that about

half of the units in both layers overlap with style-sensitivedetectors. Finally we compute the average Intersection-over-Union score (IoU) between our maximum activationpatches and the style-sensitive patches. More specifically,for each correlated unit, we rank images by the maximumactivation. Then, we sample 80 × 80 pixel patches withmaximum activation and compute the IoU score betweenthis patch and the style-sensitive patch from the correlateddetector. In this experiment, we sample a maximum acti-vation patch from the top 20 images/unit with 5 correlatedunits/style-sensitive detector. The average IoU scores of20 × 5 style-sensitive and maximum activation patches areshown in Figure 8.

These results, again, emphasize that units in the networkare fine-tuned to temporally sensitive parts of an object. Ad-ditionally, we find that only 7.5% of the patches have av-erage IoU score < 0.1 while 61.25% of the patches havethe average IoU score ≥ 0.5, confirming that the style-sensitive parts from [21] are automatically discovered byour network. Qualitative examples of high/poor correlationpatches are shown in Figure 7. While Figure 7a shows themaximum activation patches from units which are highlycorrelated with style-sensitive parts proposed by [21], Fig-ure 7b shows visual elements which have low correlationto [21]. We posit that these additional patches are alsotemporally sensitive, and contribute to our improved per-formance.

7. ApplicationsDenim miniskirts, ripped distressed jeans, denim jackets,

tracksuits, trench coats, puffy jackets, preppy polo shirtswith popped collars, neon colors, gladiator shoes, the listcan go on and on. People look back to the past and say“That’s ridiculous! What were they wearing?”. Yet some-how the fashion world recycles these trends, makes a fewtweaks, and voila! they keep coming back. It’s nearly im-possible to think of something that has absolutely neverbeen done before. Designers have to look for inspirationfrom somewhere, and what better place to be inspired bythan the past?

Page 8: When Was That Made? - arXiv · 2016. 8. 16. · When Was That Made? Sirion Vittayakorn Alexander C. Berg Tamara L. Berg University of North Carolina at Chapel Hill sirionv, aberg,

(a) 1940s (b) 1960s (c) 1970s

(d) 1970s (e) 1980s (f) 1980s

Figure 9: Predicting vintage influence in fashion collections. (a)-(f) indicate the decade of predicted influence.

0

0.05

0.1

0.15

0.2

0.25

00s 10s 20s 30s 40s 50s 60s 70s 80s 90s 00s

#co

llectio

ns

decade

The influence of vintage fashion on fashion show collections in 2000

(a) 2000

0

0.05

0.1

0.15

0.2

0.25

00s 10s 20s 30s 40s 50s 60s 70s 80s 90s 00s

#co

llectio

ns

decade

The influence of vintage fashion on fashion show collections in 2004

(b) 2004

0

0.05

0.1

0.15

0.2

0.25

2000 2002 2004 2006 2008 2010 2012 2014

#co

llectio

ns

decade

The influence of vintage fashion in 2000s

1960

1980

1990

(c) 2000-14

Figure 10: The influence of vintage fashion (1900s-2000s) in fashion show collections from (a) 2000 and (b) 2004. Figure(c) shows the influence of 1960s, 1980s and 1990s fashion trhough out 2000s - 2010s.

Therefore, we demonstrate the use of our learned mod-els for predicting the influence of fashion from past decadeson the fashion collections. In particular, we evaluate ourmodels on a Runway dataset [34] containing 300k imagesfrom fashion shows over the years of 2000 to 2015. To esti-mate the influence of the vintage fashion on fashion col-lections, we first apply our fine-tuned model to estimatewhen the outfits were made. Then, we define the inspira-tion date of the collection as the decade with the highestprobability among images from that collection. To evalu-ate our approach, we collect human judgments on the sametask using Amazon Mechanical Turk. For each assign-ment, 5 works are shown 5 fashion show images per collec-tion and ask to identify the decade that inspired these im-ages. We randomly pick up to 200 collections per predicteddecade and remove collections with low human agreement(less than 3 of 5 agree). This leaves us with 300 collec-tions for evaluation. On these collections, our model has58.33% agreement with MAE of 8.6 years compared to hu-man judgments. Some examples of our temporal predictionare shown in Figure 9.

Finally, we also observe temporal influence trends onfashion show collections over time. To do so, we explorethe classification confidence of collections from the sameyear as shown in Figure 10a - 10b. If we look at the clas-sification confidence of a particular vintage decade acrossyears, we spot some interesting trends. For example, fromFigure 10c, we can see that 1990s fashion had a strong in-fluence during both early 2000s and early 2010s while dur-ing the mid-2000s a revival of 1960s and 1980s fashion oc-curred [1].

8. ConclusionsIn this work, we first explore CNN approaches to auto-

matically estimate when objects were made, evaluated onan existing dataset of cars and two new datasets of vintageclothing photographs. Then, we provide several analysesof what the networks have learned, including exploring thetemporal sensitivity of nodes, as well as examining node ac-tivations and comparison to discriminative parts learned bythe data mining approach [21]. Finally, we propose an ap-plication for temporal estimation task of clothing in the realworld scenario.

Page 9: When Was That Made? - arXiv · 2016. 8. 16. · When Was That Made? Sirion Vittayakorn Alexander C. Berg Tamara L. Berg University of North Carolina at Chapel Hill sirionv, aberg,

References[1] 2000s in fashion. https://en.wikipedia.org/

wiki/2000s_in_fashion. Accessed: 2016-05-13.[2] T. L. Berg, A. C. Berg, and J. Shih. Automatic attribute dis-

covery and characterization from noisy web data. In ECCV,2010.

[3] C.-C. Chang and C.-J. Lin. LIBSVM: A library for supportvector machines. ACM Transactions on Intelligent Systemsand Technology, 2:27:1–27:27, 2011.

[4] N. Dalal and B. Triggs. Histograms of oriented gradients forhuman detection. In CVPR, volume 1, pages 886–893 vol. 1,June 2005.

[5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.ImageNet: A Large-Scale Hierarchical Image Database. InCVPR, 2009.

[6] C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. Efros. Whatmakes paris look like paris? ACM ToG, 31(4), 2012.

[7] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang,E. Tzeng, and T. Darrell. Decaf: A deep convolutional acti-vation feature for generic visual recognition. arXiv preprintarXiv:1310.1531, 2013.

[8] K. Duan, D. Parikh, D. Crandall, and K. Grauman. Dis-covering localized attributes for fine-grained recognition. InCVPR, pages 3474–3481. IEEE, 2012.

[9] A. Faktor and M. Irani. clustering by compositionunsuper-vised discovery of image categories. tPAMI, 36(6):1092–1106, 2014.

[10] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J.Lin. LIBLINEAR: A library for large linear classification.JMLR, 9:1871–1874, 2008.

[11] P. Fischer, A. Dosovitskiy, and T. Brox. Descriptor match-ing with convolutional neural networks: a comparison to sift.arXiv preprint arXiv:1405.5769, 2014.

[12] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-ture hierarchies for accurate object detection and semanticsegmentation. In CVPR, 2014.

[13] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-ture hierarchies for accurate object detection and semanticsegmentation. In CVPR, pages 580–587. IEEE, 2014.

[14] G. Gkioxari, B. Hariharan, R. Girshick, and J. Malik. R-cnns for pose estimation and action detection. arXiv preprintarXiv:1406.5212, 2014.

[15] K. Grauman and T. Darrell. Unsupervised learning of cat-egories from sets of partially matching image features. InCVPR, 2006.

[16] J. Hays and A. A. Efros. Im2gps: estimating geographicinformation from a single image. In CVPR, 2008.

[17] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-ing for image recognition. arXiv preprint arXiv:1512.03385,2015.

[18] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-shick, S. Guadarrama, and T. Darrell. Caffe: Convolu-tional architecture for fast feature embedding. arXiv preprintarXiv:1408.5093, 2014.

[19] S. Karayev, M. Trentacoste, H. Han, A. Agarwala, T. Darrell,A. Hertzmann, and H. Winnemoeller. Recognizing imagestyle. In BMVC, 2014.

[20] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassification with deep convolutional neural networks. InNIPS. 2012.

[21] Y. J. Lee, A. A. Efros, and M. Hebert. Style-aware mid-levelrepresentation for discovering visual connections in spaceand time. In ICCV, 2013.

[22] J. L. Long, N. Zhang, and T. Darrell. Do convnets learncorrespondence? In NIPS, 2014.

[23] D. G. Lowe. Distinctive image features from scale-invariantkeypoints. IJCV, 2004.

[24] F. Palermo, J. Hays, and A. A. Efros. Dating historical colorimages. In ECCV, pages 499–512, 2012.

[25] N. Payet and S. Todorovic. From a set of shapes to objectdiscovery. In ECCV. 2010.

[26] M. Raptis, I. Kokkinos, and S. Soatto. Discovering discrim-inative action parts from mid-level video representations. InCVPR, 2012.

[27] M. Rastegari, A. Farhadi, and D. Forsyth. Attribute discov-ery via predictable discriminative binary codes. In ECCV.2012.

[28] F. Rosenblatt. The perceptron, a perceiving and recognizingautomaton. In Project Para, Cornell Aeronautical Labora-tory, 1957.

[29] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,A. C. Berg, and L. Fei-Fei. ImageNet Large Scale VisualRecognition Challenge. IJCV, 2015.

[30] P. Sermanet, K. Kavukcuoglu, S. Chintala, and Y. LeCun.Pedestrian detection with unsupervised multi-stage featurelearning. In CVPR, 2013.

[31] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. In InternationalConference on Learning Representations, 2015.

[32] J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W. T.Freeman. Discovering objects and their location in images.In ICCV, 2005.

[33] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.Going deeper with convolutions. In CVPR, 2015.

[34] S. Vittayakorn, K. Yamaguchi, A. C. Berg, and T. L. Berg.Runway to realway: Visual analysis of fashion. In WACV,2015.

[35] M. D. Zeiler and R. Fergus. Visualizing and understandingconvolutional networks. In ECCV. 2014.

[36] N. Zhang, J. Donahue, R. Girshick, and T. Darrell. Part-based R-CNNs for fine-grained category detection. In Pro-ceedings of the European Conference on Computer Vision(ECCV), 2014.

[37] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba.Object detectors emerge in deep scene cnns. arXiv preprintarXiv:1412.6856, 2014.

[38] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva.Learning Deep Features for Scene Recognition using PlacesDatabase. NIPS, 2014.

[39] E. Zhou, H. Fan, Z. Cao, Y. Jiang, and Q. Yin. Extensive fa-cial landmark localization with coarse-to-fine convolutionalnetwork cascade. In ICCVW, 2013.