Measures of Complexity for Large Scale Image Datasets

Measures of Complexity for Large Scale ImageDatasets

Ameet Annasaheb Rahane §

University of California, BerkeleyBerkeley, CA

[email protected]

Anbumani SubramanianIntel

Bangalore, [email protected]

Abstract—Large scale image datasets are a growing trend inthe field of machine learning. However, it is hard to quantitativelyunderstand or specify how various datasets compare to eachother - i.e., if one dataset is more complex or harder to “learn”with respect to a deep-learning based network. In this work, webuild a series of relatively computationally simple methods tomeasure the complexity of a dataset. Furthermore, we presentan approach to demonstrate visualizations of high dimensionaldata, in order to assist with visual comparison of datasets. Wepresent our analysis using four datasets from the autonomousdriving research community - Cityscapes, IDD, BDD and Vistas.Using entropy based metrics, we present a rank-order complexityof these datasets, which we compare with an established rank-order with respect to deep learning.

Index Terms—Machine learning, deep learning, manifoldlearning, intrinsic dimensionality, learning theory

I. INTRODUCTION

With an ever-increasing use and need for large imagedatasets for machine learning, it is important to see howdatasets differ in terms of complexity and distribution. Whencomparing large scale image datasets, the common practiseis to evaluate metrics like total number of images, numberof images in each class, or class distributions in the dataset.However, all these are metrics defined by humans and do notreally provide any insights about the underlying distributionof data.

Our objective in this work is to enable the comparison of theunconditioned distribution of data between any two datasets.In other words – our interest in the analysis of a comparativecomplexity of data between different image datasets and toanalyze whether this aspect of complexity impacts the abilityof a neural network to learn from that dataset. Towards this,we utilize concepts from statistics, topology, and informationtheory – to capture the essence of data dimensionality withoutany knowledge of underlying data labels or classes in thedataset. We focus on three main factors and relate thesemetrics to a network learning scheme. Following are the maincontributions of our work:

(i) A study on the distribution of entropy for a given dataset(ii) Estimation of the intrinsic dimensionality of a given

dataset and thus learn about the dimensions of underlyingdata manifold

§This work was done as an intern at Intel.

(iii) Use of Uniform Manifold Approximation and Projection(UMAP) [6] to transform a given dataset, and thusvisually compare the transformations

In this work, our focus is on the datasets widely used inautonomous driving community, where semantic segmentationof images is a problem of high interest. Towards this problem,a number of new image datasets have been released in therecent years. These datasets are targeted for research in objectdetection and semantic segmentation of images from a vehicleon roads. A good performance on semantic segmentation onimage datasets informs the feasibility of building perceptionalgorithms, a critical blocks for self-driving, autonomous vehi-cles in the future. Therefore, analyzing the kind of complexityin datasets offers a glimpse of how “hard” it is to learn froma dataset compared to others and hence provides insights intobuilding generalizable agents for the problem of perception inself-driving. Our objective is to build, test and demonstratea comparison of complexity and distribution of data in largevolume, high dimensional image datasets.

II. DATASETS

In this work, we use four well known datasets in the fieldof autonomous driving: Cityscapes (CS) [1], India DrivingDataset (IDD) [4], Berkeley Deep Drive (BDD) [2], and Map-illary Vistas (Vistas) [3]. We chose these datasets specificallyto study and compare if the nature of datasets have anyinfluence on data complexity and distribution. The datasetslike CS, BDD, or Vistas contain images from structureddriving conditions while IDD was designed with images fromunstructured driving conditions and hence our interest here.

III. METHODS

A. Shannon Entropy

In our work, we measure entropy in three different ways.The standard Shannon entropy of a grayscale image is definedas:

H = −n−1∑i=0

pi log pi (1)

where n is the number of gray levels and pi is the probabilityof a pixel having a given gray level i.

arX

iv:2

008.

0443

1v1

[cs

.CV

] 1

0 A

ug 2

020

B. GLCM

The gray level co-occurence matrix (GLCM) of an imageis another commonly used entropy measure. GLCM is a his-togram of co-occuring grayscale values at a given offset overan image. GLCM helps to characterize the texture in an image,beyond the image properties like energy and homogeneity.Here, the GLCM entropy is calculated as:

Hg = −n−1∑i=0

n−1∑j=1

p(i, j) log p(i, j) (2)

where n is the number of gray levels and p(i, j) is theprobability of two pixels having intensities i and j, separatedby the specified offset.

C. Delentropy

Delentropy is a measure based on a computable probabilitydensity function called deldensity [7]. This density distributionis capable of capturing the underlying structure using spatialimage and pixel co-occurrence. This is possible because thescalar image pixel values are non-locally related to the entiregradient vector field, which means that the usage of gradientvectors in the calculation of this entropy allows for global im-age features to be considered in the calculation of delentropy.This is in contrast to the pixel level Shannon entropy, whichentirely relies on individual pixel values. Delentropy is definedsimilar to Equation 1 but uses with the new (deldensity)distribution which can capture more non-local information. [7]

D. Intrinsic Dimensionality Estimation

Intrinsic Dimensionality (ID) represents the minimal repre-sentation of the underlying manifold possible for a datasetwithout losing any information. In other words, it is thesmallest dimension required for a deep learning model toexactly represent a dataset. If m is the intrinsic dimensionalityof a dataset, it implies that a m dimensional manifold isrequired to represent the data in the dataset and that anyrepresentation less than m cannot cover the data without anyloss of information.

1) Estimator: The maximum likelihood estimator of intrin-sic dimensionality is defined in [5].

Consider a dataset with n images and let Xi represent theindividual sample (image) in the dataset. Let X1, · · · , Xn ∈Rp be i.i.d. observations with an embedding of a lowerdimensional sample. Let each sample Xi be mapped to apoint xi in the manifold. Specifically, Xi = g(Yi), whereYi are sampled from an unknown smooth density f on Rm,with some m ≤ p, and g(.) is a continuous and sufficientlysmooth mapping function. m is the intrinsic dimensionality ofthe dataset.

In other words, we pick a point x, assuming f(x) isapproximately constant in a sphere of radius R around x.Then, we can treat the observations as a homogeneous Poissonprocess in Sx(R). Using the derivation from [5] and following

the notations, the maximum likelihood estimation for m, givenk nearest neighbors, is

m̂k(x) =

[1

k − 1

k−1∑j=1

logTk(x)

Tj(x)

]−1

(3)

where T (x) is the average distance from each point to itsk-th nearest neighbor. This is an estimator of the intrinsicdimensionality which is shown to balance bias and variancebetter than all other existing solutions [5].

E. UMAP: Uniform Manifold Approximation and Projection

We use UMAP [6], a topology preserving dimensionalityreduction technique to create visual representation of theembedding of a dataset, and then qualitatively see how thisembedding relates to other metrics (discussed here) and alsorelate to available performance comparison of deep learningmethods.

1) Assumptions: The use of uniform manifold approxi-mation and projection requires a few assumptions. UMAPassumes that data must be uniformly distributed on a Rie-mannian manifold, the manifold is locally connected and theRiemannian metric is locally constant or can be approximatedas such.

2) Embedding: UMAP provides a manifold learning tech-nique for dimension reduction, with foundations in Rieman-nian geometry and algebraic topology. The first phase con-sists of constructing a fuzzy topological representation byusing simplices. Then, we optimize our embedding (by usingstochastic gradient descent) to have as close a fuzzy topolog-ical representation as possible (measured by cross entropy).This enables UMAP to construct a topological representationof the higher dimensional data in a lower dimensional space.[6]

3) Method: In our work, we define a vector space of imageas follows. We define a mapping from each image to a vectorin a high dimensional space and define some distance functionbetween any two points in that space. The most simple versionof this is the space of images with n pixels each, pointing toa vector in Rn (flattening the image into a vector). However,we can create a more intuitive representation of the imagespace as a vector space, using the gray level co-occurencematrix (GLCM). Using this feature space, we can find thedimensionality reduction dependent on the topology of texturebased features. Using this space, we can assume that theconditions listed above are met.

Using additional parameters like number of nearest neigh-bors in UMAP, we show how local and global structureschange/shift differently in different datasets - i.e., when thenearest number of neighbors to a point are low, it implies thatmore local structure is used to create the embedding, and whenthe number of nearest neighbors to a point are high, higherorder structure in the dataset is used to create the embedding.[6]

F. Computation

We used the implementation of ID from [8] and performedall our experiments on an Intel Xeon™ processor. The entirecomputation for intrinsic dimensionality was completed wellwithin 60 seconds for Cityscapes, IDD and BDD datasets.This computation increased to over an hour for Vistas, whichcontains lot more images than other datasets.

IV. RESULTS

A. Entropy

For Cityscapes, IDD, BDD, and Vistas, we used standardimplementations of the methods described in sections III-A,III-B, III-C to calculate Shannon Entropy, GLCM Entropy, anddelentropy for every image in each dataset. In Table III, wepresent the average entropy of each dataset (for each type ofentropy). figure Fig.1, 2, and 3 show the entropy distributionin each dataset (fitted to a normal distribution).

B. Intrinsic Dimensionality Estimation

Using the implementation from [8], we calculate themaximum likelihood estimation of intrinsic dimensionalityfor each dataset and these results are shown in Table I.From this table, we can infer the geometric complexity ofour datasets, and also get some interesting insights into therepresentations of the given datasets.

TABLE I: Intrinsic dimensionalities of the selected datasets.

Dataset Intrinsic Dimensionality (ID)Cityscapes 11.4

IDD 11.9BDD 13.7Vistas 30.1∗

∗: Vistas had a high estimation variance possibly due to the large datasetsize, and computational restrictions.

C. UMAP: Uniform Manifold Approximation and Projection

We use UMAP to help visualize the entire datasets in ourwork. We find an approximation to the topology of GLCMfeatures of each of our datasets, and then use it to embed thefour datasets in R2.

In Fig.4, we show a plot of each two dimensional embed-ding. Alongside each axis is the distribution of each dimensionof the embedding. Overall, it can be seen as a plot of thedistribution of a two-dimensional embedding of the dataset,with the marginals on each axis. We have shown plots of fourvariations with the number of neighbors k = 2, 20, 100, 500,to show the change in datasets, considering local to globalfocused embeddings.

V. OBSERVATIONS

A. Entropy

Of the four datasets used here, Vistas is considered mostdifficult to learn for semantic segmentation algorithms. Fromthe entropy plots, we observe possible supporting evidence ina quantitative manner - Vistas has the highest mean entropy

across all entropies, and more so for GLCM based entropyimplying that Vistas has relatively more complex texturefeatures than the other three datasets.

Based on the entropy distributions, we formulate a rankorder of complexity (lowest to highest) for the four datasets.This rank order is supported by the plots in Fig.1, 2, and 3adn the tables in Table II and III.

TABLE II: Mean (and standard deviation) of entropies of thedatasets.

Dataset Shannon pixel entropy GLCM (texture) entropy DelentropyCityscapes 6.85(0.29) 6.32(1.25) 4.35(0.35)

IDD 7.17(0.38) 7.8(1.27) 4.38(0.55)BDD 7.38(0.4) 7.79(1.05) 4.3(0.68)Vistas 7.39(0.31) 8.12(1.13) 5.01, (0.62)

TABLE III: Rank order of datasets with respect to averageentropies.

Shannon pixel entropy GLCM (texture) Shannon entropy DelentropyCityscapes Cityscapes BDD

IDD BDD CityscapesBDD IDD IDDVistas Vistas Vistas

BDD has a higher pixel entropy, although its texture andmore global structure are lower than those of IDD. For adeep learning network based semantic segmentation method,a higher pixel entropy implies that the first few layers of thenetwork which interpret pixel distributions, will have difficultyin learning and could manifest as issues like convergenceof bias weights, time taken to converge. But for the samenetwork, in case of IDD, the latter layers would have a diffi-culty in learning. Approximately, high Shannon pixel entropywould imply that a given network would have a harder timelearning the first few layers, as these layers are used to interpretpixel level behavior. Alternatively, high delentropy and GLCMtexture entropy would mean that the network would have aharder time learning in the middle or later layers, as theselayers are used to learn higher order features.

From our analysis, using an entropy-centric view, we ob-serve that the Cityscapes dataset tends to be the lowest incomplexity, aside from delentropy. Delentropy represents ahigher order structure of the images. We see that Cityscapes,BDD, and IDD are of similar complexity with respect todelentropy. This observation can be correlated by the factthat these three datasets were created out of a similar datacapture setup - dashcam-style images with view of roads –with differences in geographies and times of capture. Whereas,the Vistas dataset is created with images from both dashcam-style images and mobile-phone images. This capture setupleads to a totally diverse collection of images, with any scenethat looks like a road scene. Therefore, the images in Vistasdataset contain lot more texture patterns - i.e., higher orderfeatures, This leads to an increased difference between theentropy values, except for Shannon entropy.

Fig. 1: Shannon entropy distributions fit to normal distributions of all four datasets. Notably, as we move left to right fromCityscapes to Vistas, the distribution appears to become much tighter in nature. This can also be seen as the standard deviationbecomes smaller from Cityscapes to BDD. This implies that BDD and Vistas have fewer lower entropy images than CS andIDD. As such, when a network learns these datasets, the first few layers of the network will have a harder time learning CSand Vistas.

Fig. 2: GLCM Entropy distributions fit to normal distributions of all four datasets - Cityscapes, IDD, BDD and Vistas. Contraryto Fig.1, the texture level entropy of each dataset does not exhibit a shift in notable standard deviation.

Fig. 3: Delentropy distributions fit to normal distributions of all four datasets. Delentropy seems to be normally distributed.This could be a feature of the deldensity calculation, but this points out that global features do not exhibit a severe shift inentropy between the given datasets.

B. Intrinsic Dimensionality

The significance of Cityscapes and IDD with similar com-plexity (in Table III) can be understood as follows: thesedatasets have images with similar perspective view of the roadscene - possibly due to the similarity in data capture setupand camera placements on the vehicle. Moreover, Cityscapesand IDD lack variations in weather conditions, unlike BDD orVistas. Considering these factors, it is intuitive to understandthe similar pixel based entropy of these two datasets. Thedifference in entropy is completely dependent on the natureof data in a dataset - in this case, as different landscapes,road patterns, and driving patterns. The object (class) labelsof IDD and BDD are compatible with Cityscapes, and so haveno impact in the analysis.

C. UMAP

UMAP embedding helps to visualize the projection of eachof our datasets. From our visualization plots in Fig. 4, we

observe that “difficult” datasets appear to be more denselydistributed. The three datasets - Cityscapes, IDD, and BDDshare a lot of similarities in distribution, with minor variationswhich is easier to observe at more nubmer of nearet neighbors,k (= 100, 500) (see Fig.4).

Our observations are consistent with the premise that dis-tribution of data in these datasets have an approximatelycomplexity order, as given in Table I: Cityscapes, IDD, BDD,Vistas.

D. Accuracy of deep learning networks

From the performance metric in [4], we can observe thatdeep learning semantic segmentation algorithms already havean order of complexity as follows: Cityscapes, IDD, BDD,Vistas. This order matches the rank-order described in ouranalysis here, and help conclude that pixelwise Shannonentropy, delentropy, intrinsic dimensionality, and deep learningresults are related. Given the similarity of entropy distribu-tions, a correlation between delentropy and GLCM feature

based entropy apppears to exist. Using the UMAP specificdensity plots for network comparison, it is difficult to inferthe complexity of a dataset given the higher density of pointsin the embedding, although it does seem as though a relationexists. These require a more deeper study in future.

VI. CONCLUSION

In this work, we demonstrated that well known informationtheoretic measures like pixelwise Shannon entropy, GLCMfeature Shannon entropy, delentropy, and intrinsic dimension-ality all correlate with known results of performance metrics insemantic segmentation networks trained on the same datasets.Therefore, we present these metrics to further study relation-ships between information theoretic claims we can make aboutlarge scale image datasets and deep learning performance. Thiscould lead lead to providing guarantees about the performanceof deep learning on a given dataset or a data generationtechnique. Additionally, we present a dimensionality reductionembedding in the context of large image datasets to show thedensity of data clusters as a measure of complexity.

VII. ACKNOWLEDGEMENTS

Ameet Rahane was supported by the Intel AI StudentAmbassadorship program towards this publication.

REFERENCES

[1] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benen-son, U. Franke, S. Roth, and B. Schiele, “The Cityscapes Dataset forSemantic Urban Scene Understanding,” in Proc. of the IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), 2016.

[2] F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao, V. Madhavan, T. Darrell,“BDD100K: A Diverse Driving Video Database with Scalable Annota-tion Tooling,” arXiv preprint 1805.04687 , 2018

[3] G. Neuhold, T. Ollmann, S. Rota Bul, and P. Kontschieder. “TheMapillary Vistas Dataset for Semantic Understanding of Street Scenes.”In International Conference on Computer Vision (ICCV), 2017.

[4] G. Varma, A. Subramanian, A. Namboodiri, M. Chandraker, C. Jawahar,“IDD: A Dataset for Exploring Problems of Autonomous Navigation inUnconstrained Environments” In IEEE Winter Conf. on Applications ofComputer Vision (WACV), 2019

[5] E. Levina, P. Bickel, “Maximum Likelihood Estimation of IntrinsicDimension”, In Advances in Neural Information Processing Systems17, 2005.

[6] L. McInnes, J. Healy, J. Melville, “UMAP: Uniform Manifold Approx-imation and Projection for Dimension Reduction”, unpublished, 2018.

[7] K. Larkin, “Reflections on Shannon Information: In search of a naturalinformation-entropy for images”, arXiv preprint 1609.01117, 2016

[8] M. Cherti, “Implementation of ’Maximum LikelihoodEstimation of Intrinsic Dimension”’, Github Repository,Retreived January 9, 2020, from https://gist.github.com/mehdidc/8a0bb21a31c43b0cbbdd31d75929b5e4/, May 2016.

https://gist.github.com/mehdidc/8a0bb21a31c43b0cbbdd31d75929b5e4/

https://gist.github.com/mehdidc/8a0bb21a31c43b0cbbdd31d75929b5e4/

(a) Cityscapes

(b) IDD

(c) BDD

(d) VistasNeighbors: 2, Min. Distance: .1 Neighbors: 20, Min. Distance: .1 Neighbors: 100, Min. Distance: .1 Neighbors: 500, Min. Distance: .1

Fig. 4: Two dimensional projection of each dataset using UMAP (Sec. V-C). The plots show embedding (with each datapoint being a coordinate), with the distribution of each dimension on each axis. Many patterns can be be observed, when theembeddeings change from local to global.

Documents

Measures of Complexity for Large Scale Image Datasets