8
Turning Mobile Phones into 3D Scanners Kalin Kolev, Petri Tanskanen, Pablo Speciale and Marc Pollefeys Department of Computer Science ETH Zurich, Switzerland Abstract In this paper, we propose an efficient and accurate scheme for the integration of multiple stereo-based depth measurements. For each provided depth map a confidence- based weight is assigned to each depth estimate by eval- uating local geometry orientation, underlying camera set- ting and photometric evidence. Subsequently, all hypothe- ses are fused together into a compact and consistent 3D model. Thereby, visibility conflicts are identified and re- solved, and fitting measurements are averaged with regard to their confidence scores. The individual stages of the pro- posed approach are validated by comparing it to two al- ternative techniques which rely on a conceptually different fusion scheme and a different confidence inference, respec- tively. Pursuing live 3D reconstruction on mobile devices as a primary goal, we demonstrate that the developed method can easily be integrated into a system for monocular inter- active 3D modeling by substantially improving its accuracy while adding a negligible overhead to its performance and retaining its interactive potential. 1. Introduction There is a growing demand for easy and reliable genera- tion of 3D models of real-world objects and environments. Vision-based techniques offer a promising accessible alter- native to active laser scanning technologies with competi- tive quality. While the acquisition of photographs is trivial and does not require expertise, the generation of an image set, which ensures the desired accuracy of the subsequently obtained 3D model, is a more challenging task. Camera sen- sor noise, occlusions and complex reflectance of the scene often lead to failure in the reconstruction process but their appearance is difficult to predict in advance. This prob- lem is addressed by monocular real-time capable systems which can provide useful feedback to the user in the course of the reconstruction process and assist him in planning his movements. Interactive systems based on video cameras [6] and depth sensors [14, 7] have been demonstrated. Un- Figure 1. This paper deals with the problem of live 3D reconstruc- tion on mobile phones. The proposed approach allows to obtain 3D models of pleasing quality interactively and entirely on-device. fortunately, their usability is limited to desktop computers and high-end laptops as they rely on massive processing re- sources like multi-core CPUs and powerful GPUs. This pre- cludes applications of casual capture of 3D models but also reduces the user’s benefit from the provided visual feedback since his attention should steadily be redirected from the capturing device to the display and back. Modern smartphones and tablet computers offer im- proved mobility and interactivity, and open up new possi- bilities for live 3D modeling. While recent mobile devices are equipped with a substantial computational power like multi-core processors and graphics processing cores, their capabilities are still far from those of desktop computers. To a great extent, these restrictions render most of the currently known approaches inapplicable on mobile devices, giving room to research in the direction of specially designed, effi- cient on-line algorithms to tackle all the limitations of em- bedded hardware architectures. While first notable attempts for interactive 3D reconstruction on smartphones have al- ready been presented [8, 13, 20], an application able to pro- duce high-quality 3D models of real-world objects and en- vironments is still illusive. This paper can be regarded as an effort towards closing the gap between the capabilities of current systems for live 3D reconstruction on mobile devices and the accuracy of similar interactive systems designed for high-end systems 1

Turning Mobile Phones into 3D Scanners · an efficient and accurate scheme for integrating multiple stereo-based depth hypotheses into a compact and consis-tent 3D model. Thereby,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Turning Mobile Phones into 3D Scanners · an efficient and accurate scheme for integrating multiple stereo-based depth hypotheses into a compact and consis-tent 3D model. Thereby,

Turning Mobile Phones into 3D Scanners

Kalin Kolev, Petri Tanskanen, Pablo Speciale and Marc Pollefeys

Department of Computer ScienceETH Zurich, Switzerland

Abstract

In this paper, we propose an efficient and accuratescheme for the integration of multiple stereo-based depthmeasurements. For each provided depth map a confidence-based weight is assigned to each depth estimate by eval-uating local geometry orientation, underlying camera set-ting and photometric evidence. Subsequently, all hypothe-ses are fused together into a compact and consistent 3Dmodel. Thereby, visibility conflicts are identified and re-solved, and fitting measurements are averaged with regardto their confidence scores. The individual stages of the pro-posed approach are validated by comparing it to two al-ternative techniques which rely on a conceptually differentfusion scheme and a different confidence inference, respec-tively. Pursuing live 3D reconstruction on mobile devices asa primary goal, we demonstrate that the developed methodcan easily be integrated into a system for monocular inter-active 3D modeling by substantially improving its accuracywhile adding a negligible overhead to its performance andretaining its interactive potential.

1. Introduction

There is a growing demand for easy and reliable genera-tion of 3D models of real-world objects and environments.Vision-based techniques offer a promising accessible alter-native to active laser scanning technologies with competi-tive quality. While the acquisition of photographs is trivialand does not require expertise, the generation of an imageset, which ensures the desired accuracy of the subsequentlyobtained 3D model, is a more challenging task. Camera sen-sor noise, occlusions and complex reflectance of the sceneoften lead to failure in the reconstruction process but theirappearance is difficult to predict in advance. This prob-lem is addressed by monocular real-time capable systemswhich can provide useful feedback to the user in the courseof the reconstruction process and assist him in planning hismovements. Interactive systems based on video cameras[6] and depth sensors [14, 7] have been demonstrated. Un-

Figure 1. This paper deals with the problem of live 3D reconstruc-tion on mobile phones. The proposed approach allows to obtain3D models of pleasing quality interactively and entirely on-device.

fortunately, their usability is limited to desktop computersand high-end laptops as they rely on massive processing re-sources like multi-core CPUs and powerful GPUs. This pre-cludes applications of casual capture of 3D models but alsoreduces the user’s benefit from the provided visual feedbacksince his attention should steadily be redirected from thecapturing device to the display and back.

Modern smartphones and tablet computers offer im-proved mobility and interactivity, and open up new possi-bilities for live 3D modeling. While recent mobile devicesare equipped with a substantial computational power likemulti-core processors and graphics processing cores, theircapabilities are still far from those of desktop computers. Toa great extent, these restrictions render most of the currentlyknown approaches inapplicable on mobile devices, givingroom to research in the direction of specially designed, effi-cient on-line algorithms to tackle all the limitations of em-bedded hardware architectures. While first notable attemptsfor interactive 3D reconstruction on smartphones have al-ready been presented [8, 13, 20], an application able to pro-duce high-quality 3D models of real-world objects and en-vironments is still illusive.

This paper can be regarded as an effort towards closingthe gap between the capabilities of current systems for live3D reconstruction on mobile devices and the accuracy ofsimilar interactive systems designed for high-end systems

1

Page 2: Turning Mobile Phones into 3D Scanners · an efficient and accurate scheme for integrating multiple stereo-based depth hypotheses into a compact and consis-tent 3D model. Thereby,

(see Fig. 1). Its main contribution is the development ofan efficient and accurate scheme for integrating multiplestereo-based depth hypotheses into a compact and consis-tent 3D model. Thereby, various criteria based on localgeometry orientation, underlying camera setting and photo-metric evidence are evaluated to judge the reliability of eachmeasurement. Based on that, the proposed fusion techniquejustifies the integrity of the depth estimates and resolves vis-ibility conflicts. We demonstrate the performance of thedeveloped method within a framework for real-time 3D re-construction on a mobile phone and show that the accuracyof the system can be improved while retaining its interactiverate.

2. Related WorkAs the current paper deals with the problem of depth map

fusion, which is a classical problem in multi-view 3D recon-struction, it is related to a myriad of works on binocular andmulti-view stereo. We refer to the benchmarks in [16], [17]and [18] for a representative list. However, most of thosemethods are not applicable to our particular scenario as theyare not incremental in nature or don’t meet the efficiency re-quirements of embedded systems. In the following, we willfocus only on approaches which are conceptually related toours.

Building upon pioneering work on reconstruction witha hand-held camera [10], Pollefeys et al. [11] presenteda complete pipeline for real-time video-based 3D acquisi-tion. The system was developed with focus on capturinglarge-scale urban scenes by means of multiple video cam-eras mounted on a driving vehicle. Yet, despite its real-timeperformance, the applicability of the system on a live sce-nario is not straightforward. Nevertheless, we drew someinspiration from the utilized depth map fusion scheme, orig-inally published in [4]. The first methods for real-time in-teractive 3D reconstruction were proposed by Newcombeet al. [5] and Stuehmer et al. [19]. In both works, a 3Drepresentation of the scene is obtained by estimating depthmaps from multiple views and converting them to trianglemeshes based on the respective neighborhood connectivity.Even though these techniques cover our context, they aredesigned for high-end computers and are not functional onmobile devices due to some time-consuming optimizationoperations. Another approach for live video-based 3D re-construction, which is conceptually similar to ours, was pro-posed by Vogiatzis and Hernandez [21]. Here, the capturedscene is represented by a point cloud where each generated3D point is obtained as a probabilistic depth estimate byfusing measurements from different views. Similar to thealready discussed methods, this one also requires substan-tial computational resources. Another key difference to ourframework is the utilization of a marker to estimate cameraposes, which entails considerable limitations in terms of us-

ability. Recently, the work of Pradeep et al. [12] appeared.It presents another pipeline for real-time 3D reconstructionfrom monocular video input based on volumetric depth-mapfusion. Again, those techniques are developed for high-endcomputers and have never been demonstrated on embeddedsystems.

Probably the most similar method to ours was proposedin [22] and subsequently generalized in [3, 1]. Therein,a system for interactive in-hand scanning of objects wasdemonstrated. Similar to the approach, presented in thispaper, it relies on a surfel representation of the modeled 3Dobject. However, the developed fusion scheme is designedfor measurements stemming from active sensors, which areconsiderably more accurate than stereo-based ones. There-fore, the employed confidence estimation is quite differentfrom this proposed in the current paper.

Recently, the first works on live 3D reconstruction onmobile devices appeared. Wendel et al. [23] rely on a dis-tributed framework with a variant of [2] on a micro air vehi-cle. A tablet computer is barely used for visualization whileall demanding computations are performed on a separateserver machine. Sankar et al. [15] proposed a system forinteractively creating and navigating through visual tours.Thereby, an approximate geometry of indoor environmentsis generated based on strong planar priors and some user in-teraction. Pan et al. [8] demonstrated an automatic systemfor 3D reconstruction capable of operating entirely on a mo-bile phone. However, the generated 3D models are not veryprecise due to the sparse nature of the approach. Prisacariuet al. [13] presented a shape-from-silhouette frameworkrunning in real time on a mobile phone. Despite the im-pressive performance, the method suffers from the knownweaknesses of silhouette-based techniques, e. g. the inabil-ity to capture concavities. Tanskanen et al. [20] developed adense stereo-based system for 3D reconstruction capable ofinteractive rates on a mobile phone. We use a similar sys-tem as a starting point and show that considerable accuracyimprovements can be achieved by integrating the proposedapproach without affecting its interactive potential.

3. Multi-Resolution Depth Map ComputationIn the first stage of the 3D modeling pipeline depth maps

are created from a set of keyframes, and corresponding cal-ibration information and camera poses. Here, we adopt themethodology proposed in [20]. Apart from being efficientand accurate, it is particularly appealing due to the poten-tial of the utilized multi-resolution depth map computationscheme for implementation on mobile GPUs. In the follow-ing, we outline the procedure for the sake of completeness.More details can be found in [20].

A camera motion tracking system produces a series ofkeyframes and associated camera poses which are providedto a dense modeling module. As abrupt jumps in the cam-

Page 3: Turning Mobile Phones into 3D Scanners · an efficient and accurate scheme for integrating multiple stereo-based depth hypotheses into a compact and consis-tent 3D model. Thereby,

era motion cannot be expected, a straightforward strategyis to maintain a sliding window containing the most re-cent keyframes and use them for stereo matching but alsoto check consistency between different depth maps. Pursu-ing an interactive framework on mobile devices, binocularstereo instead of multi-view stereo is applied to minimizethe memory access overhead. In particular, a newly arrivedkeyframe is used as a reference image and is matched to anappropriate image in the current buffer. Thereby, a multi-resolution scheme for the depth map computation is em-ployed to reduce the computational time and to avoid lo-cal maxima of the photoconsistency score along the consid-ered epipolar segments. When moving from one resolutionlevel to the next, the depth range is restricted based on thedepth estimates at neighboring pixels. Additionally, com-putations are limited to pixels exhibiting sufficient local im-age texturedness within regions where the current 3D modelhas not reached the desired degree of maturity. The resultis a depth map possibly corrupted by noise due to motionblur, occlusions, lack of texture, presence of slanted sur-faces etc. A very efficient and effective filtering procedureis applied to remove the outliers. Thereby, the consistencyof each depth measurement is tested on agreement with theother depth maps within the sliding window. If a sufficientnumber of confirmations is reached, the measurement is re-tained, otherwise it is discarded as an outlier. Subsequently,the depth map is smoothed by applying bilateral filtering toimprove the precision of the depth values.

The final output of this stage is a series of partial depthmaps. We build upon this scheme and additionally computea normal vector to each depth measurement by applying alocal plane fitting procedure. Isolated points with insuffi-cient support within the neighborhood are discarded. In thenext stage, all those measurements are merged into a unified3D model of the scene.

4. Confidence-Based Depth Map FusionA central issue in the design of a depth map fusion ap-

proach is the representation of the modeled scene. Whiletriangle meshes exhibit a common geometric representa-tion, they do not seem well-suited for interactive applica-tions running in real time since considerable efforts areneeded to guarantee the integrity and consistency of themesh topology after adding, updating or removing any ver-tices. Note that the user is expected to make use of the livevisual feedback and recapture certain parts of the scene un-til the desired surface quality is reached. For that reason, werely on a surfel representation [9]. A surfel sj consists of aposition pj , normal vector Nj , color Cj and a confidencescore cj which is defined as the difference between a cumu-lative inlier and outlier weight, i. e. cj = W

(in)j −W (out)

j .Additional attributes like local patch radius or visibility in-formation could be maintained if needed. The utilized sur-

fel representation offers the required resilience since theunstructured set of surfels can easily be kept consistentthroughout any modifications.

The proposed depth map fusion approach relies on thefollowing scheme: When a new depth map becomes avail-able, a weight is assigned to each pixel measurement re-flecting its expected accuracy. Based on this input, the sur-fel model is modified by adding new surfels, updating orremoving existing ones. In the following, these steps areexplained in more detail.

4.1. Confidence-Based Weighting

The accuracy of a depth measurement, obtained fromstereo matching, depends on many factors, e. g. inherentscene texture, geometry orientation, camera noise, distancebetween the scene and the camera device etc. In an effortto capture all those aspects we assign different weights toeach estimate and combine them subsequently to obtain afinal weighting score that expresses our confidence in theparticular depth value.

Geometry-Based Weights. The accuracy of a depthmeasurement depends on the local surface orientationat that point. The depth measurement is more accuratewhen the observed geometry is fronto-parallel and lessaccurate at grazing viewing angles. As a local normalvector is computed to each depth estimate, those cases canbe identified by considering the scalar product between thenormal and the respective viewing direction of the camera.If nx ∈ S2 denotes the normal vector and vx ∈ S2 standsfor the normalized reverted viewing direction of the camerafor a pixel x ∈ Ω ⊂ Z2 within the image domain, we definea geometry-based weight to x as

wg(x) =

〈nx, vx〉 − cos(αmax)

1− cos(αmax), if (nx, vx) ≤ αmax

0, otherwise,(1)

where αmax is a critical angle at which the measurementsare considered unreliable and is set to 80 throughout allexperiments. The weight defined in (1) takes on valueswithin [0, 1]. Note that it does not directly depend on thedepth estimates. However, there is an indirect relation asthe computation of the normal vectors relies on them.

Camera-Based Weights. The accuracy of a depthmeasurement, obtained from binocular stereo, depends onthe utilized camera setting. For example, short baselinesimplicate high depth imprecision as larger changes ofthe depth along the visual rays result in small projectionfootprints on the image plane of the non-reference camera.Analogously, increasing the image resolution or movingthe camera closer to the scene leads to more accurate depthestimates. Based on these observations, a camera-based

Page 4: Turning Mobile Phones into 3D Scanners · an efficient and accurate scheme for integrating multiple stereo-based depth hypotheses into a compact and consis-tent 3D model. Thereby,

weight could be defined by measuring the depth deviationcorresponding to a certain shift (for example one pixel)along the respective epipolar line. Yet, this cannot be real-ized efficiently since it involves an additional triangulationoperation. Further complications pose the discrepancebetween viewing ray traversal and pixel sampling. Instead,we revert the inference and measure the pixel shift δ thata certain offset along the ray produces. More concretely,the offset along the visual rays is set to 1/600 of the depthrange. Then, a camera-based weight to a pixel x is definedas

wc(x) = 1− e−λδ, (2)

where λ ∈ R is a parameter specifying the penalizingbehavior of the term and is set to 5.0 throughout allexperiments, and δ is measured in pixel coordinates. Notethat wc ∈ [0, 1] is inversely proportional to the estimateddepths, i. e. larger depths get lower weights and smallerdepths get higher weights. This corresponds to the intuitionthat parts of the scene closer to the camera are expected tobe reconstructed more accurately than parts further awayfrom the camera. Moreover, the length of the baselineis also taken into account by the formulation in (2). Inparticular, depth maps, obtained from short baselines, willgenerally be weighted lower.

Photoconsistency-Based Weights. Probably the moststraightforward criterion to judge the accuracy of a depthmeasurement is its photoconsistency score. However, thisis also the least discriminative criterion since the provideddepth maps are already checked for consistency andfiltered, thus, the respective matching scores are expectedto be high. The easiest way to obtain the photoconsistencyvalue to a depth estimate is to use the one delivered by thestereo module. Yet, as normal information is available atthat point, a more accurate measure can be employed. Here,we adopt normalized cross-correlations (NCC) over 5 × 5patches where the provided normal vectors are leveragedto warp the patches from the reference image to the secondview. Then, for a pixel x we specify

wph(x) =

NCC(x), if NCC(x) ≥ thr0, otherwise

(3)

as the photoconsistency-based weight. Thereby, thr is athreshold parameter set to 0.65 throughout all experiments,and NCC(x) denotes the NCC score for the depth and thenormal at x. Again, we have wph ∈ [0, 1]. It should benoted that the computation of the photoconsistency-basedweights is more time-consuming than that of the geometry-based and the camera-based ones while having the leastcontribution to the final weighting values. For this reason,it could be omitted when more efficiency is required.

Figure 2. Confidence-based weighting of depth measurements.The reference image of a stereo pair and corresponding color-coded weights to the computed depth estimates. Green repre-sents high weighting, red represents low weighting. Note that pix-els, where the local normal vector points away from the camera,get small weights. Also, more distant measurements tend to beweighted low.

The last step is to combine all weight estimates and toprovide a final overall weight to each depth measurement inthe provided depth map. To this end, for each x we set

w(x) = wg(x) · wc(x) · wph(x). (4)

The overall weight lies in [0, 1] and will be high only whenall three weights, the geometry-based one, the camera-based one and the photoconsistency-based one, are high. Inother words, a measurement is considered as accurate if itis accurate from geometric, stereoscopic and photometricpoint of view.

Fig. 2 shows an example of the estimated weighting fora depth map capturing a small church figurine. For all depthmeasurements the corresponding weights are computed ac-cording to (4). Note that the effects from applying the ge-ometry and the camera term are clearly visible. Indeed, pix-els, where the local normal vector points away from thecamera, get small weights. Also, more distant measure-ments tend to be weighted low. The effect from applyingthe photoconsistency term is less noticeable.

4.2. Measurement Integration

When a new depth map becomes available and con-fidence weights are assigned to all measurements, theprovided data is used to update the current surfel cloud.This is done using three basic operations: surfel addition,surfel update and surfel removal. New surfels are createdfor parts of the depth map that are not explained by thecurrent model. Surfels that are in correspondence with theinput depth map are updated by integrating the respectivedepth and normal estimates. Surfels with confidence valuebelow a certain threshold are removed from the cloud. Inthe following, these operations are explained in more detail.

Surfel addition. Surfels are added in those parts wherethe depth map is not covered by model surfels. Of course,

Page 5: Turning Mobile Phones into 3D Scanners · an efficient and accurate scheme for integrating multiple stereo-based depth hypotheses into a compact and consis-tent 3D model. Thereby,

(a) (b)

(c) (d)

Figure 3. Different cases for a surfel update. Red denotes the in-coming measurement and dark red - the surfel. (a) Measurementis in front of the observed surfel. There is no visibility conflict. (b)Measurement is behind the observed surfel. There is a visibilityconflict. (c) Measurement and observed surfel match. (d) Depthsof the measurement and the observed surfel match but not theirnormals. There is a visibility conflict. See text for more details.

for the initial depth map all measurements will create newsurfels. For each newly created surfel the position andnormal vector are set according to the depth and normalestimate of the measurement. The color is set to the colorof the respective image pixel. The cumulative inlier weightis initialized with the weight of the depth measurement andthe cumulative outlier weight - with zero.

Surfel update. If the projection of a surfel coincideswith a provided depth measurement, the surfel is updated.Let sj = (pj , Nj , Cj ,W

(in)j ,W

(out)j , cj) be the surfel of

interest. If there are multiple surfels along the same visualray, we take the one closest to the camera center that isexpected to be visible. Additionally, we maintain a statevector Xj = (p1, p2, p3, θ, φ) ∈ R5 encoding its currentposition and normal. Thereby, the normal is represented bymeans of a polar angle θ and an azimuth angle φ. Whena new surfel is created, a spherical coordinate system isgenerated with the provided normal estimate as the firstbase vector. Let x = Π(pj) be the projection of the surfelonto the image plane of the current frame and let d(pj) beits depth with respect to the camera center. At x the givendepth map provides a depth measurement dx and a normalmeasurement nx. In addition to that, we get a weight w(x)reflecting the accuracy of the estimates.

Now, we have to update the surfel based on this input.

There are four different update cases (see Fig. 3):(1) d(pj) dx: The depth measurement occludes themodel surfel. By itself this is not a visibility conflict sincethe depth map could capture a different part of the surface.The dashed line in Fig. 3(a) shows a potential visibility con-figuration. In fact, this is the most delicate case as both thesurfel and the measurement could be outliers. Here, we justignore the depth measurement and do not perform any sur-fel update. Note that this could cause problems when partsof the surface are acquired which are in the line of sightof already reconstructed ones (with the same orientation).However, this is unlikely to occur in practice as the userusually captures more accessible parts first before movingto locations that are more difficult to reach.(2) d(pj) dx: The depth measurement is behind themodel surfel. This is a clear visibility conflict. In this casewe add the measurement’s weight to the cumulative outlierweight of the surfel, i. e.

W(out)j ←W

(out)j + w(x). (5)

(3) |d(pj)−dx|d(pj)< ε and (Nj , nx) ≤ 45: The measure-

ment and the model surfel match, both in terms of depthand normal orientation. Then, the surfel position and nor-mal are updated accordingly. In particular, we compute arunning weighted average

Xj ←W

(in)j Xj + w(x)Xx

W(in)j + w(x)

W(in)j ←W

(in)j + w(x),

(6)

where the pixel’s depth dx and normal nx are converted intoa state vector Xx.(4) |d(pj)−dx|d(pj)

< ε and (Nj , nx) > 45: The measure-ment and the model surfel match in terms of depth but theorientations of their normals deviate from each other. Weconsider this as a visibility conflict and increment the cu-mulative outlier weight according to (5).

Recall that there are two additional attributes to eachsurfel - a color Cj and a confidence score cj . The coloris set to the color of the pixel with the largest weightw(x) used in the fusion process for the surfel. Theconfidence measure is defined as the difference betweencumulative inlier weight and cumulative outlier weight, i. e.cj = W

(in)j −W (out)

j , and has to be updated each time oneof those values is modified.

Surfel removal. Surfels are removed from the cloudduring the acquisition process if their confidence fallsbelow a threshold. We set this threshold to−0.5 throughoutall conducted experiments. Note that the removal of surfelsopens up gaps that can be filled by new more accuratesurfels.

Page 6: Turning Mobile Phones into 3D Scanners · an efficient and accurate scheme for integrating multiple stereo-based depth hypotheses into a compact and consis-tent 3D model. Thereby,

Figure 4. Confidence evolution during reconstruction. Visualized are the color-coded confidence scores of the generated surfels for con-secutive frames of a real-world sequence. Green represents high confidence, red represents low confidence. An input image from thesame viewpoint can be seen in Fig. 2. Note how the confidence values of surfels, seen from different directions, increase in the course ofreconstruction.

One could wonder why the normals are integrated in theproposed depth map fusion scheme. In fact, they can be ob-tained in a post-processing step by considering the neigh-borhood of each point within the point cloud. There aretwo main reasons for this design decision. First, the nor-mal information is useful as it captures the local geometricstructure of each depth measurement and enables the iden-tification of accidental matches like in the case depicted inFig. 3(d). Second, the proposed scheme allows to leveragethe neighborhood relation between different measurements,provided by the camera sensor. Moreover, note that the pro-posed depth map fusion procedure is incremental and lendsitself to online applications. Also, it allows reconstructedparts of the scene to be recaptured by providing additionaldepth data and improving the accuracy of the respectivesubset of the surfel cloud.

Fig. 4 depicts the evolution of the confidence scores ofthe generated surfels for consecutive frames of a real-worldsequence. Note that the confidence values are small fornewly created surfels but increase in the course of the acqui-sition process if they are observed from other viewpoints.

5. Experimental ResultsWe validate the proposed confidence-based depth map

fusion scheme by comparing it to two state-of-the-art real-time capable alternatives. Furthermore, we demonstrate itsperformance by integrating it into a system for live 3D re-construction running on a mobile phone.

5.1. Comparison to Alternative Techniques

For the sake of comparison we implemented two alterna-tive techniques meeting the efficiency requirements of theapplication at hand.

The first one is the merging method used in [20].Thereby, the interconnection between the different inputdepth maps is exploited barely to identify inconsistenciesand to filter out outliers. All consistent depth measurementsare back-projected to 3D and merged into a unified pointcloud. Moreover, a coverage mask based on photometric

criteria is estimated in each step to reduce the generation ofredundant points. See [20] for more details.

To evaluate the viability of the confidence-based weight-ing approach, we combined the developed fusion schemewith the weight computation proposed in [4]. The basicidea of this strategy is to judge the accuracy of each depthmeasurement by analyzing the photoconsistency distribu-tion along the respective visual rays. Rays with a singlesharp maximum are expected to provide more accurate esti-mates than those exhibiting a shallow maximum or severallocal maxima. More details can be found in [4].

Fig. 5 shows the reconstructions generated by applyingall three techniques on a real-world image sequence. Oneof the input images can be seen in Fig. 2. Camera poseswere obtained by applying a version of [2]. Note that theapproach of [20] does not explicitly estimate normals tothe generated point cloud. Therefore, for the purpose ofrendering we assigned to each point a normal vector basedon the depth map that was used to create it. For the othertwo approaches we used the normal estimates obtained on-line from the fusion process. It is evident that while allthree methods achieve a high degree of completeness, theproposed one with confidence-based weighting outperformsthe others in terms of accuracy. The technique in [20] pro-duces an oversampling of the scene and is more sensitiveto noise than the other two as each 3D point is based on asingle depth measurement. This proves the importance of adepth map fusion scheme. Moreover, the reconstruction ob-tained with the proposed confidence-based weighting is sig-nificantly more accurate than the one relying on the weight-ing of [4], which validates the deployment of geometric andcamera-based criteria in the depth integration process.

5.2. Live 3D Reconstruction on a Mobile Phone

Pursuing a system for live 3D reconstruction running onmobile phones as a primary goal, we integrated the pro-posed method into the framework of [20]. This substantiallyimproved its accuracy while adding a negligible overheadof less than a second per processed image. In the follow-

Page 7: Turning Mobile Phones into 3D Scanners · an efficient and accurate scheme for integrating multiple stereo-based depth hypotheses into a compact and consis-tent 3D model. Thereby,

Figure 5. Comparison to alternative techniques. From left to right: Reconstructions with the depth map merging technique in [20], thedeveloped fusion scheme with the weighting suggested in [4] and the complete approach proposed in this paper. One of the images inthe input sequence can be seen in Fig. 2. The reconstructions contain 311135, 161647 and 181077 points, respectively. While all threemethods achieve a high degree of completeness, the proposed approach with confidence-based weighting outperforms the other two interms of accuracy.

Figure 6. Hippopotamus. Rendering of the reconstructed surfelcloud with colors and shading, and a reference image of the object.Note the accurate reconstruction of the head.

Figure 7. Relief. Rendering of the reconstructed surfel cloud withcolors and shading, and a reference image of the object. The modelwas captured outdoors.

ing, multiple reconstructions of real-world objects, gener-ated interactively on a Samsung Galaxy SIII and a SamsungGalaxy Note 3, are depicted.

Fig. 6 depicts the reconstruction of a fabric toy of a hip-popotamus. Expectedly, homogeneous regions (e. g. on theball) lead to holes in the 3D model. However, the well-textured head of the hippopotamus is reconstructed at high

Figure 8. Buddha statue. Rendering of the reconstructed surfelcloud with colors and shading, and a reference image of the object.Note the accurately captured small-scale details.

geometric precision.Fig. 7 shows the reconstruction of a relief on a decora-

tion vase. The model was captured outdoors under sunlightconditions. Note that this is a known failure case for manyactive sensors.

The capabilities of current mobile devices for in-handscanning are further demonstrated in Fig. 8. The recon-struction of a Buddha statue in a museum is visualized.Even though the generated point cloud exhibits a substan-tial amount of high-frequency noise, many small-scale de-tails like the wrinkles of the clothing or the face features arecaptured in the reconstruction.

6. ConclusionWe presented an efficient and accurate method for

confidence-based depth map fusion. At its core is a two-

Page 8: Turning Mobile Phones into 3D Scanners · an efficient and accurate scheme for integrating multiple stereo-based depth hypotheses into a compact and consis-tent 3D model. Thereby,

stage approach where confidence-based weights, that re-flect the expected accuracy, are first assigned to each depthmeasurement and subsequently integrated into a unified andconsistent 3D model. Thereby, the maintained 3D represen-tation in form of a surfel cloud is updated dynamically so asto resolve visibility conflicts and ensure the integrity of thereconstruction. The advantages of the proposed approach interms of accuracy improvements are highlighted by a com-parison to alternative techniques which meet the underly-ing efficiency requirements. Additionally, the potential ofthe developed method is emphasized by integrating it intoa state-of-the-art system for live 3D reconstruction runningon a mobile phone and demonstrating its performance onmultiple real-world objects.

AcknowledgmentsWe thank Lorenz Meier for helping with the supplemen-

tary material. This work is funded by the ETH Zurich Post-doctoral Fellowship Program, the Marie Curie Actions forPeople COFUND Program and ERC grant no. 210806.

References[1] M. Keller, D. Lefloch, M. Lambers, S. Izadi, T. Weyrich,

and A. Kolb. Real-time 3D reconstruction in dynamic scenesusing point-based fusion. In 3DV, pages 1–8, 2013. 2

[2] G. Klein and D. Murray. Parallel tracking and mapping on acamera phone. ISMAR, pages 83–86, 2009. 2, 6

[3] M. Krainin, P. Henry, X. Ren, and D. Fox. Manipulator andobject tracking for in-hand 3D object modeling. Int. J. Rob.Res., 30(11):1311–1327, 2011. 2

[4] P. Merrell, A. Akbarzadeh, L. Wang, P. Mordohai, J.-M.Frahm, R. Yang, D. Nistr, and M. Pollefeys. Real-timevisibility-based fusion of depth maps. In IEEE InternationalConference on Computer Vision (ICCV), pages 1–8, 2007. 2,6, 7

[5] R. A. Newcombe and A. J. Davison. Live dense reconstruc-tion with a single moving camera. In IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2010. 2

[6] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison.DTAM: Dense tracking and mapping in real-time. In IEEEInternational Conference on Computer Vision (ICCV), pages2320–2327, 2011. 1

[7] R. A. Newcombe et al. Kinectfusion: Real-time dense sur-face mapping and tracking. In ISMAR, pages 127–136, 2011.1

[8] Q. Pan, C. Arth, E. Rosten, G. Reitmayr, and T. Drum-mond. Rapid scene reconstruction on mobile phones frompanoramic images. In ISMAR, pages 55–64, 2011. 1, 2

[9] H. Pfister, M. Zwicker, J. van Baar, and M. Gross. Surfels:surface elements as rendering primitives. In SIGGRAPH,pages 335–342, 2000. 3

[10] M. Pollefeys, L. Van Gool, M. Vergauwen, F. Verbiest,K. Cornelis, J. Tops, and R. Koch. Visual modeling witha hand-held camera. Int. J. Comput. Vision, 59(3):207–232,2004. 2

[11] M. Pollefeys et al. Detailed real-time urban 3d reconstructionfrom video. Int. J. Comput. Vision, 78(2-3):143–167, 2008.2

[12] V. Pradeep, C. Rhemann, S. Izadi, C. Zach, M. Bleyer, andS. Bathiche. Monofusion: Real-time 3D reconstruction ofsmall scenes with a single web camera. In ISMAR, pages83–88, 2013. 2

[13] V. A. Prisacariu, O. Kaehler, D. Murray, and I. Reid. Simul-taneous 3D tracking and reconstruction on a mobile phone.In ISMAR, 2013. 1, 2

[14] S. Rusinkiewicz, O. Hall-Holt, and M. Levoy. Real-time3D model acquisition. In SIGGRAPH, pages 438–446, NewYork, NY, USA, 2002. ACM. 1

[15] A. Sankar and S. Seitz. Capturing indoor scenes with smart-phones. In ACM Symposium on User Interface Software andTechnology, 2012. 2

[16] D. Scharstein and R. Szeliski. A taxonomy and evaluationof dense two-frame stereo correspondence algorithms. Int. J.Comput. Vision, 47(1-3):7–42, Apr. 2002. 2

[17] S. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski.A comparison and evaluation of multi-view stereo recon-struction algorithms. In IEEE Conference on Computer Vi-sion and Pattern Recognition (CVPR), pages 519–528, 2006.2

[18] C. Strecha, W. von Hansen, L. V. Gool, P. Fua, and U. Thoen-nessen. On benchmarking camera calibration and multi-viewstereo for high resolution imagery. In IEEE Conference onComputer Vision and Pattern Recognition (CVPR), Anchor-age, AK, USA, 2008. 2

[19] J. Stuehmer, S. Gumhold, and D. Cremers. Real-time densegeometry from a handheld camera. In Pattern Recognition(Proc. DAGM), pages 11–20, 2010. 2

[20] P. Tanskanen, K. Kolev, L. Meier, F. Camposeco, O. Saurer,and M. Pollefeys. Live metric 3D reconstruction on mobilephones. In IEEE International Conference on Computer Vi-sion (ICCV), 2013. 1, 2, 6, 7

[21] G. Vogiatzis and C. Hernandez. Video-based, real-timemulti-view stereo. Image Vision Comput., pages 434–441,2011. 2

[22] T. Weise, T. Wismer, B. Leibe, , and L. V. Gool. In-handscanning with online loop closure. In IEEE InternationalWorkshop on 3-D Digital Imaging and Modeling, 2009. 2

[23] A. Wendel, M. Maurer, G. Graber, T. Pock, and H. Bischof.Dense reconstruction on-the-fly. In IEEE Conference onComputer Vision and Pattern Recognition (CVPR), pages1450–1457, 2012. 2