11
Geography-Aware Self-supervised Learning Kumar Ayush * Dept. of Computer Science Stanford University [email protected] Burak Uzkent * Dept. of Computer Science Stanford University [email protected] Chenlin Meng Dept. of Computer Science Stanford University [email protected] Kumar Tanmay Dept. of Electrical Engg. IIT Kharagpur [email protected] Marshall Burke Dept. of Earth Science Stanford University [email protected] David Lobell Dept. of Earth Science Stanford University [email protected] Stefano Ermon Dept. of Computer Science Stanford University [email protected] Abstract Contrastive learning methods have significantly nar- rowed the gap between supervised and unsupervised learn- ing on computer vision tasks. In this paper, we explore their application to remote sensing, where unlabeled data is of- ten abundant but labeled data is scarce. We first show that due to their different characteristics, a non-trivial gap per- sists between contrastive and supervised learning on stan- dard benchmarks. To close the gap, we propose novel train- ing methods that exploit the spatio-temporal structure of re- mote sensing data. We leverage spatially aligned images over time to construct temporal positive pairs in contrastive learning and geo-location to design pre-text tasks. Our ex- periments show that our proposed method closes the gap between contrastive and supervised learning on image clas- sification, object detection and semantic segmentation for remote sensing and other geo-tagged image datasets. 1. Introduction Inspired by the success of self-supervised learning meth- ods [4, 15], we explore their application to large-scale re- mote sensing datasets (satellite images). It has been re- cently shown that self-supervised learning methods perform comparably well or even better than their supervised learn- ing counterpart on image classification, object detection, and semantic segmentation on traditional computer vision * Equal Contribution datasets [24, 12, 15, 4, 3]. However, their application to re- mote sensing images is largely unexplored, despite the fact that collecting and labeling remote sensing images is par- ticularly costly as annotations often require domain exper- tise [41, 18, 6, 40, 35]. In this direction, we first experimentally evaluate the per- formance of an existing self-supervised contrastive learning method, MoCo-v2 [15], on remote sensing datasets, finding a performance gap with supervised learning using labels. For instance, on the Functional Map of the World (fMoW) image classification benchmark [6], we observe an 8% gap in top 1 accuracy between supervised and self-supervised methods. To bridge this gap, we propose geography-aware con- trastive learning to leverage the spatio-temporal structure of remote sensing data. In contrast to typical computer vi- sion images, remote sensing data are often geo-located and might provide multiple images of the same location over time. Contrastive methods encourage closeness of represen- tations of images that are likely to be semantically similar (positive pairs). Unlike contrastive learning for traditional computer vision images where different views (augmenta- tions) of the same image serve as a positive pair, we pro- pose to use temporal positive pairs from spatially aligned images over time. Utilizing such information allows the representations to be invariant to subtle variations over time (e.g., due to seasonality). This can in turn result in more discriminative features for tasks focusing on spatial vari- ation, such as object detection or semantic segmentation (but not necessarily for tasks involving temporal variation 1 arXiv:2011.09980v5 [cs.CV] 2 Dec 2020

Geography-Aware Self-supervised Learning · 2020. 11. 20. · Geography-Aware Self-supervised Learning *Kumar Ayush Department of Computer Science Stanford University [email protected]

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

  • Geography-Aware Self-supervised Learning

    Kumar Ayush*Dept. of Computer Science

    Stanford [email protected]

    Burak Uzkent*Dept. of Computer Science

    Stanford [email protected]

    Chenlin MengDept. of Computer Science

    Stanford [email protected]

    Kumar TanmayDept. of Electrical Engg.

    IIT [email protected]

    Marshall BurkeDept. of Earth Science

    Stanford [email protected]

    David LobellDept. of Earth Science

    Stanford [email protected]

    Stefano ErmonDept. of Computer Science

    Stanford [email protected]

    Abstract

    Contrastive learning methods have significantly nar-rowed the gap between supervised and unsupervised learn-ing on computer vision tasks. In this paper, we explore theirapplication to remote sensing, where unlabeled data is of-ten abundant but labeled data is scarce. We first show thatdue to their different characteristics, a non-trivial gap per-sists between contrastive and supervised learning on stan-dard benchmarks. To close the gap, we propose novel train-ing methods that exploit the spatio-temporal structure of re-mote sensing data. We leverage spatially aligned imagesover time to construct temporal positive pairs in contrastivelearning and geo-location to design pre-text tasks. Our ex-periments show that our proposed method closes the gapbetween contrastive and supervised learning on image clas-sification, object detection and semantic segmentation forremote sensing and other geo-tagged image datasets.

    1. IntroductionInspired by the success of self-supervised learning meth-

    ods [4, 15], we explore their application to large-scale re-mote sensing datasets (satellite images). It has been re-cently shown that self-supervised learning methods performcomparably well or even better than their supervised learn-ing counterpart on image classification, object detection,and semantic segmentation on traditional computer vision

    *Equal Contribution

    datasets [24, 12, 15, 4, 3]. However, their application to re-mote sensing images is largely unexplored, despite the factthat collecting and labeling remote sensing images is par-ticularly costly as annotations often require domain exper-tise [41, 18, 6, 40, 35].

    In this direction, we first experimentally evaluate the per-formance of an existing self-supervised contrastive learningmethod, MoCo-v2 [15], on remote sensing datasets, findinga performance gap with supervised learning using labels.For instance, on the Functional Map of the World (fMoW)image classification benchmark [6], we observe an 8% gapin top 1 accuracy between supervised and self-supervisedmethods.

    To bridge this gap, we propose geography-aware con-trastive learning to leverage the spatio-temporal structureof remote sensing data. In contrast to typical computer vi-sion images, remote sensing data are often geo-located andmight provide multiple images of the same location overtime. Contrastive methods encourage closeness of represen-tations of images that are likely to be semantically similar(positive pairs). Unlike contrastive learning for traditionalcomputer vision images where different views (augmenta-tions) of the same image serve as a positive pair, we pro-pose to use temporal positive pairs from spatially alignedimages over time. Utilizing such information allows therepresentations to be invariant to subtle variations over time(e.g., due to seasonality). This can in turn result in morediscriminative features for tasks focusing on spatial vari-ation, such as object detection or semantic segmentation(but not necessarily for tasks involving temporal variation

    1

    arX

    iv:2

    011.

    0998

    0v5

    [cs

    .CV

    ] 2

    Dec

    202

    0

  • Figure 1: Left shows the original MoCo-v2 [4] framework. Right shows the schematic overview of our approach.

    such as change detection). In addition, we design a novelunsupervised learning method that leverages geo-locationinformation, i.e., knowledge about where the images weretaken. More specifically, we consider the pretext task ofpredicting where in the world an image comes from, similarto [13, 14]. This can complement the information-theoreticobjectives typically used by self-supervised learning meth-ods by encouraging representations that reflect geograph-ical information, which is often useful in remote sensingtasks [34]. Finally, we integrate our proposed methods intoa single geography-aware contrastive learning objective.

    Our experiments on the functional Map of the World [6]dataset consisting of high spatial resolution satellite imagesshow that we improve MoCo-v2 baseline significantly. Inparticular, we improve it by ∼ 8% classification accuracywhen testing the learned representations on image classi-fication, ∼ 2% AP on object detection, ∼ 1% mIoU onsemantic segmentation, and ∼ 3% top-1 accuracy on landcover classification. Interestingly, our geography-awarelearning can even outperform the supervised learning coun-terpart on temporal data classification by ∼ 2%. With theproposed method, we can improve the accuracy on targetapplications utilizing object detection, and semantic seg-mentation [1, 2, 42] To further demonstrate the effective-ness of our geography-aware learning approach, we ex-tract the geo-location information of ImageNet images us-ing FLICKR API similar to [8], which provides us with asubset of 543,435 geo-tagged ImageNet images. We ex-tend the proposed approaches to geo-located ImageNet, andshow that geography-aware learning can improve the per-formance of MoCo-v2 by ∼ 2% on image classification,showing the potential application of our approach to anygeo-tagged dataset. Figure 1 shows our contributions in de-tail.

    2. Related WorkSelf-supervised methods use unlabeled data to learn rep-

    resentations that are transferable to downstream tasks (e.g.image classification). Two commonly seen self-supervised

    methods are pre-text task and contrastive learning.

    Pre-text tasks Pre-text task based learning [25, 44, 32, 51,46, 31] can be used to learn feature representations whendata labels are not available. [10] rotates an image and thentrains a model to predict the rotation angle. [50] trains anetwork to perform colorization of a grayscale image. [29]represents an image as a grid, permuting the grid and thenpredicting the permutation index. In this study, we use geo-location classification as a pre-text task, in which a deepnetwork is trained to predict a coarse geo-location of wherein the world the image might come from.

    Contrastive Learning Recent self-supervised contrastivelearning approaches such as MoCo [15], MoCo-v2 [4], Sim-CLR [3], PIRL [25], and FixMatch [36] have demonstratedsuperior performance and have emerged as the fore-runneron various downstream tasks. The intuition behind thesemethods are to learn representations by pulling positiveimage pairs from the same instance closer in latent spacewhile pushing negative pairs from difference instances fur-ther away. These methods, on the other hand, differ in thetype of contrastive loss, generation of positive and negativepairs, and sampling method.

    Although growing rapidly in self-supervised learningarea, contrastive learning methods have not been exploredon large-scale remote sensing dataset. In this work, we pro-vide a principled and effective approach for improving rep-resentation learning using MoCo-v2 [15] for remote sensingdata as well geo-located conventional datasets.

    Unsupervised Learning in Remote Sensing Images Un-like in traditional computer vision areas, unsupervisedlearning on remote sensing domain has not been studiedcomprehensively. Most of the studies utilize small-scaledatasets specific to a small geographical region [5, 19, 33,17, 22], a few classes [39, 28] or a highly-specific modal-ity, i.e. hyperspectral images [26, 49]. Most of thesestudies focus on the UCM-21 dataset [48] consisting ofless than 1,000 images from 21 classes. A more recent

    2

  • "gsd": "img_width": "img_height": "country_code": "cloud_cover": “timestamp":

    2.10264849663 2421 2165 IND 6 2015-11-02 T05:44:14Z

    2.06074237823 2410 2156 IND 0 2016-03-09 T05:25:30Z

    1.9968634 2498 2235 IND 1 2017-02-02 T05:47:02Z

    2.2158575 2253 2015 IND 0 2017-02-27 T05:24:30Z

    1.24525177479 4016 3592 IND 0 2015-04-09 T05:36:04Z

    1.4581833 3400 3041 IND 2 2016-12-28 T05:57:06Z

    1.2518295 4003 3581 IND 0 2017-04-12 T05:51:49Z

    Figure 2: Images over time concept in the fMoW dataset. The metadata associated with each image is shown underneath.We can see changes in contrast, brightness, cloud cover etc. in the images. These changes render spatially aligned imagesover time useful for constructing additional positives.

    -45.829337,170.730800,Te Matai,New Zealand,Otago

    10.595525,76.041355,Guruvayoor,

    India,Kerala,Thrissur

    29.008740,-81.963415,Lake Weir,Florida,MarionUnited States

    28.318038,-81.540688,Celebration,United States,Florida,Osceola

    21.881018,-102.275360,Aguascalientes,México,Aguascalientes,Aguascalientes

    59.611778,-151.449966,Homer,United States,Alaska,Kenai Peninsula Borough

    37.345989,-77.645680,Beach,United States,Virginia,Chesterfield

    37.303518,-121.897773,San Jose,United States,California,Santa Clara

    22.494622,114.028447,⼤浪,⾹港,新界,元朗區

    27.876676,-82.798231,Walsingham,United States,Florida,Pinellas

    -18.646245,24.614868,Botswana,Chobe

    Figure 3: Some examples from GeoImageNet dataset. Below each image, we list their latitudes, longitudes, city, countryname. In our study, we use the latitude and longitude information for unsupervised learning. We recommend readers tozoom-in to visualize the details of the pictures.

    Figure 4: Left The histogram of number of views. Right thehistogram of standard deviation in years per area in fMoWdataset.

    study [40] proposes large-scale weakly supervised learn-ing using a multi-modal dataset consisting of satellite im-ages and paired geo-located wikipedia articles. While be-ing effective, this method requires each satellite image tobe paired to its corresponding article, limiting the numberof images that can be used.

    Geography-aware Computer Vision Geo-location datahas been studied extensively in prior works. Most of thesestudies utilizes geo-location of an image as a prior to im-prove image recognition accuracy [37, 16, 27, 23, 7]. Otherstudies [47, 13, 14, 45] use geo-tagged training datasets to

    learn how to predict the geo-location of previously unseenimages at test time. In our study, we leverage geo-tag infor-mation to improve unsupervised and self-supervised learn-ing methods.

    3. Problem Definition

    We consider a geo-tagged visual dataset{((x1i , · · · , x

    Tii ), lati, loni)}Ni=1, where the ith data-

    point consists of a sequence of images Xi = (x1i , · · · , xTii )

    at a shared location, with latitude and longitude equalto lati, loni respectively, over time ti = 1, ..., Ti. WhenTi > 1, we refer to the dataset to have temporal informationor structure. Although temporal information is often notavailable in natural image datasets (e.g. ImageNet), it iscommon in remote sensing. While the temporal structure issimilar to that of conventional videos, there are some keydifferences that we exploit in this work. First, we considerrelatively short temporal sequences, where the time differ-ence between two consecutive ”frames” could range frommonths to years. Additionally unlike conventional videoswe consider datasets where there is no viewpoint changeacross the image sequence.

    Given our setup, we want to obtain visual representa-tions zti of images x

    ti such that the learned representation

    can be transferred to various downstream tasks. We do not

    3

  • assume access to any labels or human supervision beyondthe lati, loni geo-tags. The quality of the representationsis measured by their performance on various downstreamtasks. Our primary goal is to improve the performanceof self-supervised learning by utilizing the geo-coordinatesand the unique temporal structure of remote sensing data.

    3.1. Functional Map of the World

    Functional Map of the World (fMoW) is a large-scalepublicly available remote sensing dataset [6] consisting ofapproximately 363,571 training images and 53,041 test im-ages across 62 highly granular class categories. It providesimages (temporal views) from the same location over time(x1i , · · · , x

    Tii ) as well as geo-location metadata (lati, loni)

    for each image. Fig. 4 shows the histogram of the numberof temporal views in fMoW dataset. We can see that most ofthe areas have multiple temporal views where Ti can rangefrom 1 to 21, and on average there is about 2.5-3 years ofdifference between the images from an area. Also, we showexamples of spatially aligned images in Fig. 2. As seen inFig. 6, fMoW is a global dataset consisting of images fromseven continents which can be ideal for learning global re-mote sensing representations.

    3.2. GeoImageNet

    Following [8], we extract geo-coordinates for a sub-set of images in ImageNet [9] using the FLICKR API.More specifically, we searched for geo-tagged imagesin ImageNet using the FLICKR API, and were able tofind 543,435 images with their associated coordinates(lati, loni) across 5150 class categories. This dataset ismore challenging than ImageNet-1k as it is highly imbal-anced and contains about 5× more classes. In the rest ofthe paper, we refer to this geo-tagged subset of ImageNet

    Figure 5: Examples of GeoImageNet classes specific to aregion in the world. Left shows some animals mostly foundin the same region of the world and Right shows someclasses specific to certain countries. A small portion of theKoala and African Elephant pictures have been captured inzoos in North America. We note that we do not project co-ordinates to the world map in this figure.

    Figure 6: Top shows the distribution of the fMoW and Bot-tom shows the distribution of GeoImageNet.

    as GeoImageNet. Upon publication, we will release theGeoImageNet dataset publicly for the research community.

    We show some examples from GeoImageNet in Fig. 3.As shown in the figure, for some images we have geo-coordinates that can be predicted from visual cues. For ex-ample, we see that a picture of a person with a Sombrerohat was captured in Mexico. Similarly, an Indian Elephantpicture was captured in India, where there is a large popula-tion of Indian Elephants. Next to it, we show the picture ofan African Elephant (which is larger in size). If a model istrained to predict where in the world the image was taken,it should be able to identify visual cues that are transferableto other tasks (e.g., visual cues to differentiate Indian Ele-phants from the African counterparts). Figure 6 shows thedistribution of images in the GeoImageNet dataset. In Fig. 5we show that certain animals in GeoImageNet are populatedin specific regions of the world whereas some other classesare specific to certain countries, making geo-location pre-diction task an ideal task for learning location-aware repre-sentations.

    4. Method

    In this section, we briefly review contrastive loss func-tions for unsupervised learning and detail our proposed ap-proach to improve Moco-v2 [4], a recent contrastive learn-ing framework, on geo-located data.

    4

  • 4.1. Contrastive Learning Framework

    Contrastive [15, 3, 4, 38, 30] methods attempt to learna mapping fq : xti 7→ zti ∈ Rd from raw pixels xti tosemantically meaningful representations zti in an unsuper-vised way. The training objective encourages representa-tions corresponding to pairs of images that are known a pri-ori to be semantically similar (positive pairs) to be closerto each other than typical unrelated pairs (negative pairs).With similarity measured by dot product, recent approachesin contrastive learning differ in the type of contrastive lossand generation of positive and negative pairs. In this work,we focus on the state-of-the-art contrastive learning frame-work MoCo-v2 [4], an improved version of MoCo [15], andstudy improved methods for the construction of positive andnegative pairs tailored to remote sensing applications.

    The contrastive loss function used in the MoCo-v2framework is InfoNCE [30], which is defined as follows fora given data sample:

    Lz = − logexp(z · ẑ/λ)

    exp(z · ẑ/λ) +∑Nj=1 exp(z · kj/λ)

    , (1)

    where z and ẑ are query and key representations obtainedby passing the two augmented views of xti (denoted v andv′ in Fig. 1) through query and key encoders, fq and fk pa-rameterized by θq and θk respectively. Here z and ẑ forma positive pair. The N negative samples, {kj}Nj=1, comefrom a dictionary of representations built as a queue. Werefer readers to [15] for details on this. λ ∈ R+ is the tem-perature hyperparameter.

    The key idea here is to encourage representations of pos-itive (semantically similar) pairs to be closer, and negative(semantically unrelated) pairs to be far apart as measuredby dot product. The construction of positive and negativepairs plays a crucial role in this contrastive learning frame-work. MoCo and MoCo-v2 both use perturbations (alsocalled “data augmentation”) from the same image to createa positive example and perturbations from different imagesto create a negative example. Commonly used perturbationsinclude random color jittering, random horizontal flip, andrandom grayscale conversion.

    4.1.1 Temporal Positive Pairs

    Different from many commonly seen natural imagedatasets, remote sensing datasets often have extra temporalinformation, meaning that for a given location (lati, loni),there exists a sequence of spatially aligned images Xi =(x1i , · · · , x

    Tii ) over time. Unlike in traditional videos where

    nearby frames could experience large changes in content(e.g. from a cat to a tree), in remote sensing the contentis often more stable across time due to the fixed viewpoint.For instance, a place on ocean is likely to remain as ocean

    2016-04-17T15:49:27Z 2012-11-21T15:17:29Z2016-11-10T16:00:51Z 2016-11-10T16:00:51Z2011-06-06T15:56:51Z

    Figure 7: Demonstration of temporal positives in eq. 2. Animage from an area is paired to the other images includ-ing itself from the same area captured at different time. Weshow the time stamps for each image underneath the im-ages. We can see the color changes in the stadium seatingsand surrounding areas.

    for months or years, in which case satellite images takenacross time at the same location should share high semanticsimilarities. Even for locations where non-trivial changesdo occur over time, certain semantic similarities could stillremain. For instance, key features of a construction site arelikely to remain the same even as the appearance changesdue to seasonality.

    Given these observations, it is natural to leverage tempo-ral information for remote sensing while constructing pos-itive or negative pairs since it can provide us with extrasemantically meaningful information of a place over time.More specifically, given an image xt1i collected at time t1,we can randomly select another image xt2i that is spatiallyaligned with xt1i (i.e. x

    t2i ∈ Xi). We then apply perturba-

    tions (e.g. random color jittering) as used in MoCo-v2 to thespatially aligned image pair xt1i and x

    t2i , providing us with

    a temporal positive pair (denoted v and v′ in Figure 1) thatcan be used for training the contrastive learning frameworkby passing them through query and key encoders, fq and fkrespectively (see Fig. 1). Note that when t1 = t2, the tem-poral positive pair is the same as the positive pair used inMoCo-v2.

    Given a data sample xt1i , our TemporalInfoNCE objec-tive function can be formulated as follows:

    Lzt1i

    = − log exp(zt1i · z

    t2i /λ)

    exp(zt1i · zt2i /λ) +

    N∑j=1

    exp(zt1i · kj/λ), (2)

    where zt1i and zt2i are the encoded representations of the

    randomly perturbed temporal positive pair xt1i , xt2i . Similar

    as equation 1, N is number of negative samples, {kj}Nj=1are the encoded negative pairs and λ ∈ R+ is the temper-ature hyperparameter. Again, we refer readers to [15] fordetails on construction of these negative pairs.

    Note that compared to equation 1, we use two real im-ages from the same area over time to create positive pairs.

    5

  • We believe that relying on real images for positive pairs en-courages the network to learn better representations for realdata than the one relying on synthetic images. On the otherhand, our objective in equation 2 enforces the representa-tions to be invariant to changes over time. Depending onthe target task, such inductive bias can be desirable or unde-sirable. For example, for a change detection task, learningrepresentations that are highly sensitive to temporal changescan be more preferable. On the other hand, for image clas-sification or object detection task, learning temporally in-variant features should not degrade the downstream perfor-mance. In our experiments, we perform additional experi-ments on temporal data classification to analyze the effectof using temporal positive pairs.

    4.2. Geo-location Classification as a Pre-text Task

    In this section, we explore another aspect of remote sens-ing images, geolocation metadata, to further improve thequality of the learned representations. Geo-location meta-data has been previously explored by different studies to im-prove image recognition accuracy [37, 23, 7, 20, 6], imageretrieval [11] and predict geo-location of the images miss-ing meta-data [13, 14]. In this study, we utilize geo-locationmetadata for unsupervised learning.

    In this direction, we design a pre-text task for unsuper-vised learning. In our pre-text task, we cluster the im-ages in the dataset using their coordinates (lati, loni). Weuse a clustering method to construct K clusters and assignan area with coordinates (lati, loni) a categorical geo-labelci ∈ C = {1, · · · ,K}. We then optimize a geo-locationpredictor network fc using the cross-entropy loss function

    Lg =

    K∑k=1

    −p(ci = k) log(p̂(ci = k|fc(xti)), (3)

    where p represent the probability of the cluster k represent-ing the true cluster and p̂ represents the predicted probabili-ties forK clusters. In our experiments, we represent fc witha CNN parameterized by θc. With this objective, our goal isto learn location-aware representations that can potentiallytransfer well to different downstream tasks.

    4.3. Combining Geo-location Classification andContrastive Learning Loss

    In the previous section, we designed a pre-text task lever-aging the geo-location meta-data of the images to learnlocation-aware representations in a standalone setting. Inthis section, we combine geo-location prediction and con-trastive learning tasks in a single objective to improve thecontrastive learning-only and geo-location learning-onlytasks. In this direction, we first integrate the geo-locationclassification task into the contrastive learning frameworkusing the cross-entropy loss function where the geo-location

    prediction network uses features zti from the query encoderas

    Lg = −K∑i=1

    p(ci = k) log(p̂(ci = k|fc(zti)). (4)

    In the combined framework (see Fig. 1), fc is representedby a linear layer parameterized by θc. Finally, we proposethe objective for joint learning as the linear combination ofTemporalInfoNCE and geo-classification loss with coeffi-cients α and β representing the importance of contrastivelearning and geo-location learning losses as

    argminθq,θk,θc

    Lf = αLzt1 + βLg (5)

    The objective is jointly optimized as a function of the queryand key encoder parameters θq , θk and θc. By combiningtwo different tasks, we learn representations to jointly max-imize agreement between spatio-temporal positive pairs,minimize agreement between negative pairs and predict thegeo-label of the images from the positive pairs.

    5. ExperimentsIn this study, we perform representation learning experi-

    ments on fMoW and GeoImageNet datasets. We then eval-uate the learned representations on a variety of downstreamtasks including image recognition, object detection and se-mantic segmentation benchmarks on remote sensing andconventional images.

    Figure 8: Left shows the number of clusters per label andRight shows the number of unique labels per cluster infMoW and GeoImageNet. Labels represent the originalclasses in fMoW and GeoImageNet.

    Implementation Details for Unsupervised Learning Forcontrastive learning, similar to MoCo-v2 [4], we useResNet-50 to paramaterize the query and key encoders,fq and fk ,in all experiments. We use following hyper-parameters in the MoCo-v2 pre-training step: learning rateof 1e-3, batch size of 256, dictionary queue of size with65536, temperature scaling of 0.2 and SGD optimizer.We use similar setups for both fMoW and GeoImageNet

    6

  • datasets. Finally, for each downstream experiment, we re-port results for the representations learned after 100 and/or200 epochs.

    For geo-location classification task, we run k-Meansclustering algorithm to cluster fMoW and GeoImageNetinto K = 100 geo-clusters given their latitude and longi-tude pairs. We show the clusters in Fig. 9. As seen in the fig-ure, while both datasets have similar clusters there are somedifferences, particularly in North America and Europe. InFig. 8 we analyze the clusters in GeoImageNet and fMoW.The figure shows that the number of clusters per class onGeoImageNet tend to be skewed towards smaller numbersthan fMoW whereas the number of unique classes per clus-ter on GeoImageNet has more of a uniform distribution. ForfMoW, we can conclude that each cluster contain samplesfrom most of the classes. Finally, when adding the geo-location classification task into the contrastive learning wetune α and β to be 1.

    Methods We compare our unsupervised learning approachto supervised learning for image recognition task. For ob-ject detection, and semantic segmentation we compare themto pre-trained weights obtained using (a) supervised learn-ing, and (b) random initilization while fine-tuning on thetargert task dataset. Finally, for ablation analysis we provideresults using different combinations of our methods. Whenappending only geo-location classification task or temporalpositives into MoCo-v2 we use MoCo-v2+Geo and MoCo-

    Figure 9: Top shows the distribution of the fMoW clustersand Bottom shows the distribution of GeoImageNet clus-ters.

    v2+TP. When adding both of our approaches into MoCo-v2 we use MoCo-v2+Geo+TP.

    5.1. Experiments on fMoW

    We first perform experiments on fMoW dataset. Simi-lar to [4], we evaluate the learned representation for remotesensing images by training a linear classification layer withsupervised learning.

    Backbone Accuracy ↑(100 Epochs)Accuracy ↑

    (200 Epochs)

    Sup. Learning* ResNet50 69.05 69.05Geoloc. Learning* ResNet50 52.40 52.40

    MoCo-V2 ResNet50 58.32 60.69MoCo-V2+Geo ResNet50 63.65 64.07MoCo-V2+TP ResNet50 67.15 68.32

    MoCo-V2+Geo+TP ResNet50 65.77 66.33

    Table 1: Experiments on fMoW on classifying single im-ages. * indicates a model trained up to epoch with the high-est accuracy on the validation set. We use the same set upfor Sup. Learning and Geoloc. Learning in the remainingexperiments.

    Classifying Single Images In Table 1, we report the resultson single image classification on fMoW. We would like tohighlight that in this case we classify each image individ-ually. In other words, we do not use the prior informationthat multiple images over the same area (x1i , x

    2i , . . . , x

    Tii )

    have the same labels (yi, yi, . . . , yi). For evaluation, weuse 53,041 images. Our results on this task show thatMoCo-v2 performs reasonably well on a large-scale datasetwith 60.69% accuracy, 8% less than the supervised learningmethod. This result aligns with MoCo-v2’s performanceon the ImageNet dataset [4]. Next, by incorporating geo-location classification task into MoCo-v2, we improve by3.38% in top-1 classification accuracy. We further improvethe results to 68.32% using temporal positives, bridging thegap between the MoCo-v2 baseline and supervised learningto less than 1%.

    Classifying Temporal Data In the next step, we changehow we perform testing across multiple images over anarea at different times. In this case, we predict labelsfrom images over an area i.e. make a prediction for eacht ∈ {1, . . . , Ti}, and average the predictions from that area.We then use the most confident class prediction to get area-specific class predictions. In this case, we evaluate the per-formance on 11,231 unique areas that are represented bymultiple images at different times. Our results in Table 2show that doing area-specific inference improves the classi-fication accuracies by 4-8% over image-specific inference.Even incorporating temporal positives, we can improve theaccuracy by 6.1% by switching from image classification to

    7

  • temporal data classification. Overall, our methods outper-form the baseline Moco-v2 by 4-6% and supervised learn-ing by 1.2%.

    Backbone Accuracy ↑(100 Epochs)Accuracy ↑

    (200 Epochs)

    Sup. Learning* ResNet50 73.24 (+4.19) 73.24 (+4.19)Geoloc. Learning* ResNet50 56.12 (+3.72) 56.12 (+3.72)

    MoCo-V2 ResNet50 62.45 (+4.13) 68.64 (+7.95)MoCo-V2+Geo ResNet50 67.11 (+3.46) 70.48 (+6.41)MoCo-V2+TP ResNet50 68.04 (+0.89) 74.42 (+6.10)

    MoCo-V2+Geo+TP ResNet50 67.58 (+1.73) 72.76 (+6.43)

    Table 2: Experiments on fMoW on classifying temporaldata. In the table, we compare the results to the ones onsingle image classification.

    5.2. Transfer Learning Experiments

    Previously, we performed pre-training experiments onfMoW dataset and quantified the quality of the represen-tations by training a linear layer for image recognition onfMoW using supervised learning. In this section, we per-form transfer learning experiments on different datasets.

    5.2.1 Object Detection

    For object detection, we use the xView dataset [18] consist-ing of high resolution satellite images captured with similarsensors to the ones in the fMoW dataset. The xView datasetconsists of 846 very large (∼2000×2000 pixels) satelliteimages with bounding box annotations for 60 different classcategories including airplane, passenger vehicle, maritimevessel, helicopter etc.

    Implementation Details We first divide the set of largeimages into 700 training and 146 test images. Then, weprocess the large images to create 416×416 pixels imagesby randomly sampling the bounding box coordinates of thesmall image and we repeat this process 100 times for eachlarge image. In this process, we ensure that there is less than25% overlap between any two bounding boxes from thesame image. We then use RetinaNet [21] with pre-trainedResNet-50 backbone and fine-tune the full network on thexView training set. To train RetinaNet, we use learning rateof 1e-5 and a batch size of 4 and employ Adam optimizerto learn the weights.

    Qualitative Analyis Table 3 shows the object detectionperformance on the xView test set. We achieve the bestresults with the addition of temporal positive pair, and geo-location classification pre-text task into MoCo-v2. Withour final model, we can outperform the randomly initial-ized weights by 7% AP and the supervised learning on thefMoW by 3.3% AP.

    pre-train AP50 ↑Random Init. 10.75Sup. Learning 14.42

    MoCo-V2 15.45 (+4.70)MoCo-V2-Geo 15.63 (+4.88)MoCo-V2-TP 17.65 (+6.90)

    MoCo-V2-Geo+TP 17.74 (+6.99)

    Table 3: Object detection results on the xView dataset.

    5.2.2 Image Segmentation

    In this section, we perform downstream experiments on thetask of Semantic Segmentation on SpaceNet dataset [43].The SpaceNet datasets consists of 5000 high resolutionsatellite images with segmentation masks for buildings.

    Implementation Details We divide our SpaceNet datasetinto training and test sets of 4000 and 1000 images respec-tively. We use PSAnet [52] network with ResNet-50 back-bone to perform semantic segmentation. We train PSAnetnetwork with a batch size of 16 and a learning rate of 0.01for 100 epochs. In all segmentation experiments, we useSGD optimizer.

    Qualitative Analysis Table 4 shows the segmentation per-formance of differently initialized backbone weights on theSpaceNet test set. Similar to object detection, we achievethe best IoU scores with the addition of temporal positivesand geo-location classification task. Our final model out-performs the randomly initialized weights and supervisedlearning by 3.58% and 2.94% IoU scores. We observe thatthe gap between the best and worst models shrinks goingfrom the image recognition to object detection, and seman-tic segmentation task. This aligns with the performance ofthe MoCo-v2 pre-trained on ImageNet and fine-tuned onthe Pascal-VOC object detection and semantic segmenta-tion experiments [15, 4].

    pre-train mIOU ↑Random Init. 74.93Sup. Learning 75.57

    MoCo-V2 78.05 (+3.12)MoCo-V2-Geo 78.42 (+3.49)MoCo-V2-TP 78.48 (+3.55)

    MoCo-V2-Geo+TP 78.51 (+3.58)

    Table 4: Semantic segmentation experiments on the Space-Net dataset.

    5.2.3 Land Cover Classification

    Finally, we perform transfer learning experiments on landcover classification across 66 land cover classes using highresolution remote sensing images obtained by the USDA’s

    8

  • pre-train Top-1 Accuracy ↑Random Init. 51.89Sup. Learning 54.46

    MoCo-V2 55.18 (+3.29)MoCo-V2-Geo 58.23 (+6.34)MoCo-V2-TP 57.10 (+5.21)

    MoCo-V2-Geo+TP 57.63 (+5.74)

    Table 5: Land Cover Classification on NAIP dataset.

    National Agricultural Imagery Program (NAIP). We use theimages from the California’s Central Valley for the year of2016. Our final dataset consists of 100,000 training and50,000 test images. Table 5 shows that our method outper-forms the randomly initialized weights by 6.34% and super-vised learning by 3.77%.

    5.3. Experiments on GeoImageNet

    After fMoW, we adopt our methods for unsupervisedlearning on fMoW for improving representation learning onthe GeoImageNet. Unfortunately, since ImageNet does notcontain images from the same area over time we are not ableto integrate the temporal positive pairs into the MoCo-v2objective. However, in our GeoImageNet experiments weshow that we can improve MoCo-v2 by introducing geo-location classification pre-text task.

    Qualitative Analysis Table 6 shows the top-1 and top-5classification accuracy scores on the test set of GeoIma-geNet. Surprisingly, with only geo-location classificationtask we can achieve 22.26% top-1 accuracy. With MoCo-v2baseline, we get 38.51 accuracy, about 3.47% more than su-pervised learning method. With the addition of geo-locationclassification, we can further improve the top-1 accuracy by1.45%. These results are interesting in a way that MoCo-v2(200 epochs) on ImageNet-1k performs 8% worse than su-pervised learning whereas it outperforms supervised learn-ing on our highly imbalanced GeoImageNet with 5150 classcategories which is about 5× more than ImageNet-1k.

    Backbone Top-1(Accuracy) ↑Top-5

    (Accuracy) ↑Sup. Learning ResNet50 35.04 54.11

    Geoloc. Learning ResNet50 22.26 39.33MoCo-V2 ResNet50 38.51 57.67

    MoCo-V2+Geo ResNet50 39.96 58.71

    Table 6: Experiments on GeoImageNet. We train MoCo-V2 and MoCo-V2+Geo for 200 epochs whereas Sup. andGeoloc. Learning are trained until they converge. Finally,we divide the dataset into 443,435 training and 100,000 testimages across 5150 classes.

    6. ConclusionIn this work, we provide a self-supervised learning

    framework for remote sensing data, where unlabeled data isoften plentiful but labeled data is scarce. By leveraging spa-tially aligned images over time to construct temporal posi-tive pairs in contrastive learning and geo-location in the de-sign of pre-text tasks, we are able to close the gap betweenself-supervised and supervised learning on image classifica-tion, object detection and semantic segmentation on remotesensing and other geo-tagged image datasets.

    References[1] Kumar Ayush, Burak Uzkent, Marshall Burke, David Lobell,

    and Stefano Ermon. Efficient poverty mapping using deepreinforcement learning. arXiv preprint arXiv:2006.04224,2020. 2

    [2] Kumar Ayush, Burak Uzkent, Marshall Burke, David Lobell,and Stefano Ermon. Generating interpretable poverty mapsusing object detection in satellite images. arXiv preprintarXiv:2002.01612, 2020. 2

    [3] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge-offrey Hinton. A simple framework for contrastive learningof visual representations. arXiv preprint arXiv:2002.05709,2020. 1, 2, 5

    [4] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He.Improved baselines with momentum contrastive learning.arXiv preprint arXiv:2003.04297, 2020. 1, 2, 4, 5, 6, 7, 8

    [5] Anil M Cheriyadat. Unsupervised feature learning for aerialscene classification. IEEE Transactions on Geoscience andRemote Sensing, 52(1):439–451, 2013. 2

    [6] Gordon Christie, Neil Fendley, James Wilson, and RyanMukherjee. Functional map of the world. In Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition, pages 6172–6180, 2018. 1, 2, 4, 6

    [7] Grace Chu, Brian Potetz, Weijun Wang, Andrew Howard,Yang Song, Fernando Brucher, Thomas Leung, and HartwigAdam. Geo-aware networks for fine-grained recognition. InProceedings of the IEEE International Conference on Com-puter Vision Workshops, pages 0–0, 2019. 3, 6

    [8] Terrance de Vries, Ishan Misra, Changhan Wang, and Lau-rens van der Maaten. Does object recognition work for ev-eryone? In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition Workshops, pages 52–59, 2019. 2, 4

    [9] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,and Li Fei-Fei. Imagenet: A large-scale hierarchical imagedatabase. In 2009 IEEE conference on computer vision andpattern recognition, pages 248–255. Ieee, 2009. 4

    [10] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Un-supervised representation learning by predicting image rota-tions. arXiv preprint arXiv:1803.07728, 2018. 2

    [11] Raul Gomez, Jaume Gibert, Lluis Gomez, and DimosthenisKaratzas. Location sensitive image retrieval and tagging. InEuropean Conference on Computer Vision, pages 649–665.Springer, 2020. 6

    9

  • [12] Jean-Bastien Grill, Florian Strub, Florent Altché, CorentinTallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch,Bernardo Avila Pires, Zhaohan Guo, Mohammad Ghesh-laghi Azar, et al. Bootstrap your own latent-a new approachto self-supervised learning. Advances in Neural InformationProcessing Systems, 33, 2020. 1

    [13] James Hays and Alexei A Efros. Im2gps: estimating geo-graphic information from a single image. In 2008 ieee con-ference on computer vision and pattern recognition, pages1–8. IEEE, 2008. 2, 3, 6

    [14] James Hays and Alexei A Efros. Large-scale image geolo-calization. In Multimodal location estimation of videos andimages, pages 41–62. Springer, 2015. 2, 3, 6

    [15] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and RossGirshick. Momentum contrast for unsupervised visual rep-resentation learning. In Proceedings of the IEEE/CVF Con-ference on Computer Vision and Pattern Recognition, pages9729–9738, 2020. 1, 2, 5, 8

    [16] Hayate Iso, Shoko Wakamiya, and Eiji Aramaki. Densityestimation for geolocation via convolutional mixture densitynetwork. arXiv preprint arXiv:1705.02750, 2017. 3

    [17] Neal Jean, Sherrie Wang, Anshul Samar, George Azzari,David Lobell, and Stefano Ermon. Tile2vec: Unsupervisedrepresentation learning for spatially distributed data. In Pro-ceedings of the AAAI Conference on Artificial Intelligence,volume 33, pages 3967–3974, 2019. 2

    [18] Darius Lam, Richard Kuzma, Kevin McGee, Samuel Doo-ley, Michael Laielli, Matthew Klaric, Yaroslav Bulatov, andBrendan McCord. xview: Objects in context in overheadimagery. arXiv preprint arXiv:1802.07856, 2018. 1, 8

    [19] Yansheng Li, Chao Tao, Yihua Tan, Ke Shang, and JinwenTian. Unsupervised multilayer feature learning for satelliteimage scene classification. IEEE Geoscience and RemoteSensing Letters, 13(2):157–161, 2016. 2

    [20] Dahua Lin, Ashish Kapoor, Gang Hua, and Simon Baker.Joint people, event, and location recognition in personalphoto collections using cross-domain context. In EuropeanConference on Computer Vision, pages 243–256. Springer,2010. 6

    [21] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, andPiotr Dollár. Focal loss for dense object detection. In Pro-ceedings of the IEEE international conference on computervision, pages 2980–2988, 2017. 8

    [22] Xiaoqiang Lu, Xiangtao Zheng, and Yuan Yuan. Remotesensing scene classification by unsupervised representationlearning. IEEE Transactions on Geoscience and RemoteSensing, 55(9):5148–5157, 2017. 2

    [23] Oisin Mac Aodha, Elijah Cole, and Pietro Perona. Presence-only geographical priors for fine-grained image classifica-tion. In Proceedings of the IEEE International Conferenceon Computer Vision, pages 9596–9606, 2019. 3, 6

    [24] Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan,Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe,and Laurens van der Maaten. Exploring the limits of weaklysupervised pretraining. In Proceedings of the European Con-ference on Computer Vision (ECCV), pages 181–196, 2018.1

    [25] Ishan Misra and Laurens van der Maaten. Self-supervisedlearning of pretext-invariant representations. In Proceedingsof the IEEE/CVF Conference on Computer Vision and Pat-tern Recognition, pages 6707–6717, 2020. 2

    [26] Lichao Mou, Pedram Ghamisi, and Xiao Xiang Zhu. Un-supervised spectral–spatial feature learning via deep resid-ual conv–deconv network for hyperspectral image classifica-tion. IEEE Transactions on Geoscience and Remote Sensing,56(1):391–406, 2017. 2

    [27] Eric Muller-Budack, Kader Pustu-Iren, and Ralph Ewerth.Geolocation estimation of photos using a hierarchical modeland scene classification. In Proceedings of the EuropeanConference on Computer Vision (ECCV), pages 563–579,2018. 3

    [28] T Nathan Mundhenk, Goran Konjevod, Wesam A Sakla,and Kofi Boakye. A large contextual dataset for classifica-tion, detection and counting of cars with deep learning. InEuropean Conference on Computer Vision, pages 785–800.Springer, 2016. 2

    [29] Mehdi Noroozi and Paolo Favaro. Unsupervised learningof visual representations by solving jigsaw puzzles. InEuropean Conference on Computer Vision, pages 69–84.Springer, 2016. 2

    [30] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre-sentation learning with contrastive predictive coding. arXivpreprint arXiv:1807.03748, 2018. 5

    [31] Deepak Pathak, Ross Girshick, Piotr Dollár, Trevor Dar-rell, and Bharath Hariharan. Learning features by watch-ing objects move. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 2701–2710, 2017. 2

    [32] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, TrevorDarrell, and Alexei A Efros. Context encoders: Featurelearning by inpainting. In Proceedings of the IEEE con-ference on computer vision and pattern recognition, pages2536–2544, 2016. 2

    [33] Adriana Romero, Carlo Gatta, and Gustau Camps-Valls. Un-supervised deep feature extraction for remote sensing imageclassification. IEEE Transactions on Geoscience and RemoteSensing, 54(3):1349–1362, 2015. 2

    [34] Evan Sheehan, Chenlin Meng, Matthew Tan, Burak Uzkent,Neal Jean, Marshall Burke, David Lobell, and Stefano Er-mon. Predicting economic development using geolocatedwikipedia articles. In Proceedings of the 25th ACM SIGKDDInternational Conference on Knowledge Discovery & DataMining, pages 2698–2706, 2019. 2

    [35] Evan Sheehan, Burak Uzkent, Chenlin Meng, Zhongyi Tang,Marshall Burke, David Lobell, and Stefano Ermon. Learningto interpret satellite images using wikipedia. arXiv preprintarXiv:1809.10236, 2018. 1

    [36] Kihyuk Sohn, David Berthelot, Chun-Liang Li, ZizhaoZhang, Nicholas Carlini, Ekin D Cubuk, Alex Kurakin, HanZhang, and Colin Raffel. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. arXivpreprint arXiv:2001.07685, 2020. 2

    [37] Kevin Tang, Manohar Paluri, Li Fei-Fei, Rob Fergus, andLubomir Bourdev. Improving image classification with lo-cation context. In Proceedings of the IEEE international

    10

  • conference on computer vision, pages 1008–1016, 2015. 3,6

    [38] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Con-trastive multiview coding. arXiv preprint arXiv:1906.05849,2019. 5

    [39] Burak Uzkent, Aneesh Rangnekar, and Matthew J Hoffman.Tracking in aerial hyperspectral videos using deep kernelizedcorrelation filters. IEEE Transactions on Geoscience andRemote Sensing, 57(1):449–461, 2018. 2

    [40] Burak Uzkent, Evan Sheehan, Chenlin Meng, Zhongyi Tang,Marshall Burke, David Lobell, and Stefano Ermon. Learningto interpret satellite images in global scale using wikipedia.arXiv preprint arXiv:1905.02506, 2019. 1, 3

    [41] Burak Uzkent, Evan Sheehan, Chenlin Meng, Zhongyi Tang,Marshall Burke, David B Lobell, and Stefano Ermon. Learn-ing to interpret satellite images using wikipedia. In IJCAI,pages 3620–3626, 2019. 1

    [42] Burak Uzkent, Christopher Yeh, and Stefano Ermon. Effi-cient object detection in large images using deep reinforce-ment learning. In The IEEE Winter Conference on Applica-tions of Computer Vision, pages 1824–1833, 2020. 2

    [43] Adam Van Etten, Dave Lindenbaum, and Todd M Bacastow.Spacenet: A remote sensing dataset and challenge series.arXiv preprint arXiv:1807.01232, 2018. 8

    [44] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, andPierre-Antoine Manzagol. Extracting and composing robustfeatures with denoising autoencoders. In Proceedings of the25th international conference on Machine learning, pages1096–1103, 2008. 2

    [45] Nam Vo, Nathan Jacobs, and James Hays. Revisiting im2gpsin the deep learning era. In Proceedings of the IEEE Inter-national Conference on Computer Vision, pages 2621–2630,2017. 3

    [46] Xiaolong Wang and Abhinav Gupta. Unsupervised learn-ing of visual representations using videos. In Proceedings ofthe IEEE international conference on computer vision, pages2794–2802, 2015. 2

    [47] Tobias Weyand, Ilya Kostrikov, and James Philbin. Planet-photo geolocation with convolutional neural networks. InEuropean Conference on Computer Vision, pages 37–55.Springer, 2016. 3

    [48] Yi Yang and Shawn Newsam. Bag-of-visual-words and spa-tial extensions for land-use classification. In Proceedings ofthe 18th SIGSPATIAL international conference on advancesin geographic information systems, pages 270–279, 2010. 2

    [49] Lefei Zhang, Liangpei Zhang, Bo Du, Jane You, andDacheng Tao. Hyperspectral image unsupervised classifi-cation by robust manifold matrix factorization. InformationSciences, 485:154–169, 2019. 2

    [50] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorfulimage colorization. In European conference on computervision, pages 649–666. Springer, 2016. 2

    [51] Richard Zhang, Phillip Isola, and Alexei A Efros. Split-brainautoencoders: Unsupervised learning by cross-channel pre-diction. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 1058–1067, 2017. 2

    [52] Hengshuang Zhao, Yi Zhang, Shu Liu, Jianping Shi, ChenChange Loy, Dahua Lin, and Jiaya Jia. Psanet: Point-wisespatial attention network for scene parsing. In Proceedingsof the European Conference on Computer Vision (ECCV),pages 267–283, 2018. 8

    11