DeepSketch: Fast Sketch-Based 3D Shape Retrievalcecilia77/files/CS280...DeepSketch: Fast Sketch-Based 3D Shape Retrieval Cecilia Zhang Weilun Sun Abstract Freehand sketches are a simple

DeepSketch: Fast Sketch-Based 3D Shape Retrieval

Cecilia Zhang Weilun Sun

Abstract

Freehand sketches are a simple but powerful tool for communication. They con-tain rich information to specify shapes, and thus retrieving 3D shapes from 2Dsketches has received considerable attention in the field of computer graphics andcomputer vision. In this project, we present a system for cross-domain similar-ity search that helps us with sketch-based 3D shape retrieval. Instead of usinghand crafted features for searching, we propose our DeepSketch neural networkthat is built on Siamese network to learn features that are basis for later similaritysearch using K-nearest Neighbor (KNN). We further did analysis on how individ-ual strokes of a sketch image affect retrieval results, and also visualized featureslearned from our DeepSketch network.

1 Introduction

Retrieving 3D shapes is an important problem that has many applications in fields such as interiordesign and animation, where it is necessary to find specific 3D models for tasks. Using sketches for3D shape retrieval is an attractive idea since sketches are easy and fast to input while contain richsemantic and geometric information.

Standard sketch-based shape retrieval amounts to solve two main challenges. One is to find anoptimal view for a single 3D model, and the other is to find a feature space that is representativeof sketches and views of the 3D models. Multiple views are rendered for each 3D model and anautomatic procedure is used to return the most representative view for the model. However, a lotof time human-input sketch images have a large variation even if the sketches belong to the sameobject. Thus it is hard to define a best view for a 3D model, and sometimes the best view may noteven exist.

To bypass the ’best view’ challenge, we are inspired by [8] to use convolutional neural network(CNN) to learn cross domain similarities between the sketch space and the 3D model view space.More specifically, our training network is built upon two separate Siamese networks [1]; one forsketches and one for rendered 3D model views. The key idea is to define a cross domain lossconnecting the two Siamese CNN such that features extracted from the rendered views of 3D shapeslie close to those extracted from the sketches.

We test our algorithm on the SHREC13 [4], a large scale hand-drawn sketch query dataset forquerying from a 3D model dataset, and demonstrate its effectiveness for 3D shape model retrievalfrom 2D sketches. We further did analysis on how individual strokes affect the retrieval results byplotting an importance map for each sketch. We also wrote a graphics user interface to do real-timeinference and retrieve 3D models. Figure1 is an illustration of our end-to-end system design ofsketch based 3D shape retrieval.

2 Related Work

3D shape retrieval has received interest from the computer vision and graphics communities overthe past decade. Initial works [7] focused on retrieving 3D shape models using other 3D shapes ortext keywords as input.

1

Figure 1: System overview. Given an input 2D sketch query from the user, our trained Siamesenetwork outputs a feature vector and finds matching rendered views of 3D shape models.

Recently, more attention has been paid to 3D shape retrieval from 2D sketches. The method of [5]uses initial keyword input and then refines the retrieval using an input sketch of the desired view,and the method of [3] uses image-based retrieval. [2] performs 3D shape model retrieval from 2Dsketches using rendered views of the 3D models and a feature transform based on bags-of-visual-words of features from a bank of Gabor filters, and we extend this work by exploring faster retrievalalgorithms as well as better methods of clustering examples for calculating our bag-of-visual-wordsfeatures. Most recently, convolutional neural networks are used [8] as feature extraction to perform3D shape retrieval from 2D sketches.

3 Learning feature representations

3.1 Network Architecture

Siamese CNN has been demonstrated of great success in weakly supervised metric learning [1]. Thenetwork takes in pairs of data that usually has binary labels. Loss function is defined over the pairs

and is written in L(d, S) = 12N

∑N

n=1(S)d2+(1−S)max(margin−d, 0)2 where d is L2 distance

between the two features of the input pair d = ‖y1 − y2‖2.

We have two domains in our task, the sketch domain and the rendered view domain. We trainedtwo separate Siamese CNN for the two domains and have a contrastive loss that connects the twonetworks. We used two separate networks to learn features from and sketch and view because wewant to enforce optimal learned filters for each domain. Replacing the two separate Siamese CNNwith two standard classification net (e.g. Alexnet) are also possible, but training both sketch andview images with a single Alexnet has bad performance, since the filters for the two domains do notshare completely.

An illustration of our network architecture is shown in Figure 3.1. On the left, we show an illustra-tion of sketch and view domains before and after learning. Different geometry shapes correspond todifferent view information, and different color corresponds to different categories. Note that duringtraining we did not specify any view similarity within a class (e.g. the object could be facing backor front), and thus after training, images of different views will still be clustered together as long asthey belong to the same class. We will also show results of clustering learned features within a class.It appears that even if we did not include constraints on views, the learned features have some viewinformation. On the right, we demonstrate our network architecture. All three losses, the sketchloss, view loss and the cross domain loss are contrastive losses. Input image size is 128 by 128. Wespecify the length of each feature vector to be 64, and weights are shared in the sketch Siamese andalso in the view Siamese, but not between the two.

3.2 Retrieval

Learned feature vectors are of length 64 for each sketch and view images. Retrieval is done using Knearest neighbors (KNN). We used euclidean distance as the metric.

3.3 Rendered views

We randomly chose three views for the 3D models. We did not choose the most representative viewfor a 3D model, but instead we offer these three views and believe that the chance of all three views

2

Figure 2: Illustration and Architecture of our DeepSketch Siamese CNN.

Figure 3: (LEFT)examples of rendered views from 3D models.(RIGHT) example model and sketchfor the data set SHREC’13

being degenerate is low. All 3D models are rendered from the same three randomly chosen views.Some example renderings are shown in Figure 3.3

4 Experiments

4.1 Dataset

SHREC’13 dataset is a subset of the Princeton Shape Benchmark. It has 1258 3D models and 90classes. The sketches in each class has 80 instances. These 80 sketch instances are split in two sets:50 for training and 30 for testing. No validation set is used. Note that the number of 3D models ineach class varies. For example, the largest class has 184 instances but there are 23 classes containingno more than 5 3D models.Thus the dataset we are using is not optimal and definitely has bias.

4.2 Generate data Pairs for DeepSketch

To make sure there is enough similar and non-similar pairs of data. For each sketch, we generate 2positive pairs and 10 negative pairs. All data pairs contain 4 images: 2 for sketch and 2 for view.Positive data pair contains images all from the same class while negative data pair contains imagesall from different classes. Positive data pairs are labeled 1 and negative pairs are labeled 0. We didnot do data augmentation in our experiments. In total we trained on 5445 data pairs and tested on2700 data pairs. We show some of our positive and negative data pairs in Figure 4.2

3

Figure 4: examples of input training pairs (LEFT), positive example input pair (RIGHT) negativeexample input pair

4.3 Network Settings

4.4 Evaluation

We did sketch-sketch retrieval, sketch-view retrieval as well as view-sketch retrieval. All experi-ments are evaluated on top 1 and top 3 retrieval accuracy.

5 Results

5.1 Retrieval Accuracy

We evaluated the retrieval accuracy across the sketch and view domains. More specifically, given asketch or a rendered view, we retrieve the nearest (top 1) or the three nearest (top 3) images fromthe search domain. Chance accuracy is 1.1% since we have 90 classes in total. Overall retrievalaccuracies are reasonably good given the bias and small dataset we use. We can observe view-sketch retrieval over-performs sketch-view. We believe this is caused by the small variation of 3Dmodels of a class and large variation of sketch images of a class.

sketch-sketch sketch-view view-sketch

Top 1 0.383 0.326 0.564

Top 3 0.523 0.457 0.792

Table 1: retrieval accuracy in the two domains

For sketch-view retrieval, we also show the top 5 classes that have the highest retrieval accuracies.For view-sketch retrieval, we here list the top 5 classes with the highest retrieval accuracies. We can

wineglass wine-bottle chair hot_air_balloon pig

Top 1 0.80 0.77 0.73 0.73 0.73

Top 3 0.83 0.83 0.77 0.77 0.73

Table 2: top 5 classes with top sketch-view retrieval accuracy

observe some overlapping classes with the one got from sketch-view retrieval. The learned featuresin sketch and view domains are consistent in this manner.

beer-mug duck hammer hand hot_air_balloon

Table 3: top 5 classes with top view-sketch retrieval accuracy

Below we show some positive and negative retrieval results for sketch-view and view-sketch re-trieval.

4

Figure 5: examples of sketch-view retrieval results (LEFT) positive retrieval result (RIGHT) negativeretrieval result

Figure 6: examples of view-sketch retrieval results (LEFT) positive retrieval result (RIGHT) negativeretrieval result

5.2 Feature Visualization

We visualized the learned feature vectors using T-SNE, in sketch space and view space separately,as well as in the combined space.

There are fewer clusters shown in the T-SNE for view space than in the sketch space. One reasoncould be the data itself in that the number of models per class is unbalanced for 3D models.

We think it could be interesting to look at a combined space for both sketch and view features. InFigure 5.2, we did T-SNE for all the sketch and view features, and plot them separately.

Notice that in the combined T-SNE graph, there are certain classes that map to the same position forsketch and view domain, shown in red circled regions. But there is also a big cluster in the sketchdomain not being able to map to the view domain, circled in green.

5.3 Stroke Analysis

Inspired by [6], we did analysis on individual strokes in the sketch domain, and found that differentstrokes weigh differently in retrieval results.

We did the stroke analysis on two scenarios. One is on sketches that originally perform well while theother is on sketches that originally perform bad. For both cases, we removed one/several stroke(s)from the original sketch, and did inference again on the ’modified’ sketch to see their retrieval

5

Figure 7: T-SNE visualization of sketch domain

results, shown in 5.3. We plot the importance map for those strokes; green are strokes that playa positive role in retrieval, blue are neutral strokes and red are negative strokes, which means thatremoving those strokes result in better retrieval results.

5.4 Feature clustering

We also did a K-means clustering within the learned sketch feature space. Notice that we did notspecify view information when training our network, but we found there are certain view informationlearned, as shown in 5.4.

6 Discussion

A common question asked is how well it performs if we train sketch and view images together asa classification task. We tried out to use Alexnet and trained on all images to do a 90-class clas-sification. However, the accuracy stays at 10% for a long time and the learning cannot converge.Although it is still higher than chance rate, the result is not satisfying. We think the difference be-tween the sketch feature space and view feature space (as shown in the combined T-sne visualizationin 5.2 can somehow explain the low accuracy of training all images in a single network).

6

Figure 8: T-SNE visualization of view domain

Another concern is the bias in the SHREC’13 Dataset. It is still a small dataset comparing to currentlarge-scale image dataset such as ImageNet. The unbalanced number of 3D models per class is alsonot desirable, especially there are even some duplicated models. And the rendering algorithm weused to generate view images from 3D models also affect our learned feature space.

But the overall retrieval results are reasonably good given the dataset we use. Our deepSketchnetwork also demonstrates its effectiveness in learning feature representations of both the sketchand view space for shape retrieval task.

7 Conclusion

In this work, we presented a system to retrieve 3D shape models from a dataset given the query ofa single 2D sketch. We demonstrated that with the help of Siamese CNNs, we are able to learn thefeature space for both sketch and view domains to retrieve accurate 3D shape models with hand-drawn 2D sketches.

It would also be interesting to explore methods to combine multiple input query methods, such as acombination of sketches and keywords.

We have put our code as well as caffe protobuf and models online at ceciliavision/sketchRetrieval.

7

Figure 9: T-SNE visualization of combined sketch and view domains

Figure 10: Individual stroke analysis

8

Figure 11: K-means clustering in sketch feature space

9

References

[1] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with ap-plication to face verification. In Computer Vision and Pattern Recognition, 2005. CVPR 2005.IEEE Computer Society Conference on, volume 1, pages 539–546. IEEE, 2005.

[2] M. Eitz, R. Richter, T. Boubekeur, K. Hildebrand, and M. Alexa. Sketch-based shape retrieval.ACM Trans. Graph., 31(4):31–1, 2012.

[3] T. Funkhouser, P. Min, M. Kazhdan, J. Chen, A. Halderman, D. Dobkin, and D. Jacobs. Asearch engine for 3d models. ACM Transactions on Graphics (TOG), 22(1):83–105, 2003.

[4] B. Li, Y. Lu, A. Godil, T. Schreck, M. Aono, H. Johan, J. M. Saavedra, and S. Tashiro. Shrec’13track: large scale sketch-based 3d shape retrieval. In Proceedings of the Sixth EurographicsWorkshop on 3D Object Retrieval, pages 89–96. Eurographics Association, 2013.

[5] J. Loffler. Content-based retrieval of 3d models in distributed web databases by visual shapeinformation. In Information Visualization, 2000. Proceedings. IEEE International Conferenceon, pages 82–87. IEEE, 2000.

[6] R. G. Schneider and T. Tuytelaars. Sketch classification and classification-driven analysis usingfisher vectors.

[7] P. Shilane, P. Min, M. Kazhdan, and T. Funkhouser. The princeton shape benchmark. In Shapemodeling applications, 2004. Proceedings, pages 167–178. IEEE, 2004.

[8] F. Wang, L. Kang, and Y. Li. Sketch-based 3d shape retrieval using convolutional neural net-works. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pages 1875–1883, 2015.

10

IntroductionRelated WorkLearning feature representationsNetwork ArchitectureRetrievalRendered views

ExperimentsDatasetGenerate data Pairs for DeepSketchNetwork SettingsEvaluation

ResultsRetrieval AccuracyFeature VisualizationStroke AnalysisFeature clustering

DiscussionConclusion

Documents

DeepSketch: Fast Sketch-Based 3D Shape Retrievalcecilia77/files/CS280...DeepSketch: Fast Sketch-Based 3D Shape Retrieval Cecilia Zhang Weilun Sun Abstract Freehand sketches are a simple