NeRF: Representing Scenes as Neural Radiance Fields for View … · 2020-03-20 · NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis 3 {A positional encoding

NeRF: Representing Scenes asNeural Radiance Fields for View Synthesis

Ben Mildenhall1? Pratul P. Srinivasan1? Matthew Tancik1?

Jonathan T. Barron2 Ravi Ramamoorthi3 Ren Ng1

1UC Berkeley 2Google Research 3UC San Diego

Abstract. We present a method that achieves state-of-the-art resultsfor synthesizing novel views of complex scenes by optimizing an under-lying continuous volumetric scene function using a sparse set of inputviews. Our algorithm represents a scene using a fully-connected (non-convolutional) deep network, whose input is a single continuous 5D coor-dinate (spatial location (x, y, z) and viewing direction (θ, φ)) and whoseoutput is the volume density and view-dependent emitted radiance atthat spatial location. We synthesize views by querying 5D coordinatesalong camera rays and use classic volume rendering techniques to projectthe output colors and densities into an image. Because volume renderingis naturally differentiable, the only input required to optimize our repre-sentation is a set of images with known camera poses. We describe how toeffectively optimize neural radiance fields to render photorealistic novelviews of scenes with complicated geometry and appearance, and demon-strate results that outperform prior work on neural rendering and viewsynthesis. View synthesis results are best viewed as videos, so we urgereaders to view our supplementary video for convincing comparisons.

Keywords: scene representation, view synthesis, image-based render-ing, volume rendering, 3D deep learning

1 Introduction

In this work, we address the long-standing problem of view synthesis in a newway by directly optimizing parameters of a continuous 5D scene representationto minimize the error of rendering a set of captured images. We represent ascene as a continuous 5D function that outputs the radiance emitted in eachdirection (θ, φ) at each point (x, y, z) in space, and a density at each point whichacts like a differential opacity controlling how much radiance is accumulatedby a ray passing through (x, y, z). Our method optimizes a deep fully-connectedneural network without any convolutional layers (often referred to as a multilayerperceptron or MLP) to represent this function by regressing from a single 5Dcoordinate (x, y, z, θ, φ) to a single volume density and view-dependent RGBcolor. To render this neural radiance field (NeRF) from a particular viewpointwe: 1) march camera rays through the scene to generate a sampled set of 3D

? Authors contributed equally to this work.

arX

iv:2

003.

0893

4v1

[cs

.CV

] 1

9 M

ar 2

020

2 B. Mildenhall, P. P. Srinivasan, M. Tancik et al.

Input Images Optimize NeRF Render new views

Fig. 1: We present a method that optimizes a continuous 5D neural radiancefield representation (volume density and view-dependent color at any continuouslocation) of a scene from a set of input images. We use techniques from volumerendering to accumulate samples of this scene representation along rays to renderthe scene from any viewpoint. Here, we visualize the set of 100 input views of thesynthetic Drums scene randomly captured on a surrounding hemisphere, and weshow two novel views rendered from our optimized NeRF representation.

points, 2) use those points and their corresponding 2D viewing directions asinput to the neural network to produce an output set of colors and densities,and 3) use classical volume rendering techniques to accumulate those colors anddensities into a 2D image. Because this process is naturally differentiable, wecan use gradient descent to optimize this model to represent a complex scene byminimizing the error between each observed image and the corresponding viewsrendered from our representation. Minimizing this error across multiple viewsencourages the network to predict a coherent model of the scene by assigninghigh volume densities and accurate colors to the locations that contain the trueunderlying scene content. Figure 2 visualizes this overall pipeline.

We find that the basic implementation of optimizing a neural radiance fieldrepresentation for a complex scene does not converge to a sufficiently high-resolution representation and is inefficient in the required number of samples percamera ray. We address these issues by transforming input 5D coordinates witha positional encoding that enables the MLP to represent higher frequency func-tions, and we propose a hierarchical sampling procedure to reduce the number ofqueries required to adequately sample this high-frequency scene representation.

Our approach inherits the benefits of volumetric representations: both canrepresent complex real-world geometry and appearance and are well suited forgradient-based optimization using projected images. Crucially, our method isdesigned to overcome the prohibitive storage costs of discretized voxel gridswhen modeling complex scenes at high-resolutions.

In summary, our key technical contributions are:

– An approach for representing continuous scenes with complex geometry andmaterials as 5D neural radiance fields, parameterized as basic MLP networks.

– A differentiable rendering procedure based on classical volume rendering tech-niques, which we use to optimize these representations from standard RGBimages. This includes a hierarchical sampling strategy to allocate the MLP’scapacity towards space with visible scene content.

NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis 3

– A positional encoding to map each input 5D coordinate into a higher dimen-sional space, which enables us to successfully optimize neural radiance fieldsto represent high-frequency scene content.

We demonstrate that our resulting neural radiance field method quantitativelyand qualitatively outperforms state-of-the-art view synthesis methods, includingworks that fit neural 3D representations to scenes as well as works that train deepconvolutional networks to predict sampled volumetric representations. As far aswe know, this paper presents the first continuous neural scene representationthat is able to render high-resolution photorealistic novel views of real objectsand scenes from RGB images captured in natural settings.

2 Related Work

A promising recent direction in computer vision is encoding objects and scenesin the weights of an MLP that directly maps from a 3D spatial location toan implicit representation of the shape, such as the signed distance [5] at thatlocation. However, these methods have so far been unable to reproduce realisticscenes with complex geometry with the same fidelity as techniques that representscenes using discrete representations such as triangle meshes or voxel grids. Inthis section, we review these two lines of work and contrast them with ourapproach, which enhances the capabilities of neural scene representations toproduce state-of-the-art results for rendering complex realistic scenes.

Neural 3D shape representations Recent work has investigated the implicitrepresentation of continuous 3D shapes as level sets by optimizing deep networksthat map xyz coordinates to a signed distance function [23] or to an occupancyfield [19]. However, these models are limited by their requirement of access toground truth 3D geometry, typically obtained from synthetic 3D shape datasetssuch as ShapeNet [3]. Subsequent work has relaxed this requirement of groundtruth 3D shapes by formulating differentiable rendering functions that allowneural implicit shape representations to be optimized using only 2D images.Niemeyer et al. [21] represent surfaces as 3D occupancy fields and use a numer-ical method to find the surface intersection for each ray, then calculate an exactderivative using implicit differentiation. Each ray intersection location is thenprovided as the input to a neural 3D texture field that predicts a diffuse colorfor that point. Sitzmann et al. [30] use a less direct neural 3D representation thatsimply outputs a feature vector and RGB color at each continuous 3D coordinate,and propose a differentiable rendering function consisting of a recurrent neuralnetwork that marches along each ray to decide where the surface is located.Though these techniques can potentially represent arbitrarily complicated andhigh-resolution scene geometries, they have so far been limited to simple shapeswith low geometric complexity resulting and produce oversmoothed renderedviews. We show that an alternate strategy of optimizing networks to encode 5Dradiance fields (3D volumes with 2D view-dependent appearance) can representhigher-resolution geometry and appearance to render photorealistic novel viewsof complex scenes.


View synthesis and image-based rendering The computer graphics com-munity has made significant progress in photorealistic novel view synthesis bypredicting traditional geometry and appearance representations from observedimages. One popular class of approaches uses mesh-based representations ofscenes with either diffuse [34] or view-dependent [2,6,35] appearance. Differ-entiable rasterizers [4,8,15,17] or pathtracers [14,22] can directly optimize meshrepresentations to reproduce a set of input images using gradient descent. How-ever, gradient-based mesh optimization based on image reprojection is oftendifficult, likely because of local minima or poor conditioning of the loss land-scape. Furthermore, this strategy requires a template mesh with fixed topologyto be provided as an initialization before optimization [14], which is typicallyunavailable for unconstrained real-world scenes.

Another class of methods use volumetric representations to specifically ad-dress the task of high-quality photorealistic view synthesis from a set of inputRGB images. Volumetric approaches are able to realistically represent complexshapes and materials, are well-suited for gradient-based optimization, and tendto produce less visually distracting artifacts than mesh-based methods. Early vol-umetric approaches used observed images to directly color voxel grids [12,28,32].More recently, several methods [7,20,24,31,37] have used large datasets of multi-ple scenes to train deep networks that predict a sampled volumetric representa-tion from a set of input images, and then use alpha-compositing [25] along raysto render novel views at test time. Other works have optimized a combination ofconvolutional networks (CNNs) and sampled voxel grids for each specific scene,such that the CNN can compensate for discretization artifacts from low resolu-tion voxel grids [29] or allow the predicted voxel grids to vary based on inputtime or animation controls [16]. While these volumetric techniques have achievedimpressive results for novel view synthesis, their ability to scale to higher res-olution imagery is fundamentally limited by poor time and space complexitydue to their discrete sampling — rendering higher resolution images requires afiner sampling of 3D space. We circumvent this problem by instead encoding acontinuous volume within the parameters of a deep fully-connected neural net-work, which not only produces significantly higher quality renderings than priorvolumetric approaches, but also requires just a fraction of the storage cost ofthose sampled volumetric representations.

3 Neural Radiance Field Scene Representation

We represent a continuous scene as a 5D vector-valued function whose input isa 3D location x = (x, y, z) and 2D viewing direction (θ, φ), and whose outputis an emitted color c = (r, g, b) and volume density σ. In practice, we expressdirection as a 3D Cartesian unit vector d. We approximate this continuous 5Dscene representation with an MLP network FΘ : (x,d)→ (c, σ) and optimize itsweights Θ to map each input 5D coordinate to its corresponding volume densityand directional emitted color.


(x,y,z,θ,ϕ)

FΘ

(RGBσ)

5D InputPosition + Direction

OutputColor + Density

Volume Rendering

Ray 1σ

σ

RenderingLoss

g.t.

g.t.

2

2

2

2

Ray 2

Ray 1

Ray Distance

(b)(a) (c) (d)

Ray 2

Fig. 2: An overview of our neural radiance field scene representation and differ-entiable rendering procedure. We synthesize images by sampling 5D coordinates(location and viewing direction) along camera rays (a), feeding those locationsinto an MLP to produce a color and volume density (b), and using volume ren-dering techniques to composite these values into an image (c). This renderingfunction is differentiable, so we can optimize our scene representation by mini-mizing the residual between synthesized and ground truth observed images (d).

We encourage the representation to be multiview consistent by restrictingthe network to predict the volume density σ as a function of only the locationx, while allowing the RGB color c to be predicted as a function of both locationand viewing direction. To accomplish this, the MLP FΘ first processes the input3D coordinate x with 8 fully-connected layers (using ReLU activations and 256channels per layer), and outputs σ and a 256-dimensional feature vector. Thisfeature vector is then concatenated with the camera ray’s viewing direction andpassed to 4 additional fully-connected layers (using ReLU activations and 128channels per layer) that output the view-dependent RGB color.

See Fig. 3 for an example of how our method uses the input viewing directionto represent non-Lambertian effects. As shown in Fig. 4, a model trained withoutview dependence (only x as input) has difficulty representing specularities.

4 Volume Rendering with Radiance Fields

Our 5D neural radiance field represents a scene as the volume density and di-rectional emitted radiance at any point in space. We render the color of any raypassing through the scene using principles from classical volume rendering [10].The volume density σ(x) can be interpreted as the differential probability of aray terminating at an infinitesimal particle at location x. The expected colorC(r) of camera ray r(t) = o + td with near and far bounds tn and tf is:

C(r) =

∫ tf

tn

T (t)σ(r(t))c(r(t),d)dt , where T (t) = exp

(−∫ t

tn

σ(r(s))ds

). (1)

The function T (t) denotes the accumulated transmittance along the ray fromtn to t, i.e., the probability that the ray travels from tn to t without hitting


(a) View 1 (b) View 2 (c) Radiance Distributions

Fig. 3: A visualization of view-dependent emitted radiance. Our neural radiancefield representation outputs RGB color as a 5D function of both spatial positionx and viewing direction d. Here, we visualize example directional color distri-butions for two spatial locations in our neural representation of the Ship scene.In (a) and (b), we show the appearance of two fixed 3D points from two dif-ferent camera positions: one on the side of the ship (orange insets) and one onthe surface of the water (blue insets). Our method predicts the changing spec-ular appearance of these two 3D points, and in (c) we show how this behaviorgeneralizes continuously across the whole hemisphere of viewing directions.

any other particle. Rendering a view from our continuous neural radiance fieldrequires estimating this integral C(r) for a camera ray traced through each pixelof the desired virtual camera.

We numerically estimate this continuous integral using quadrature. Deter-ministic quadrature, which is typically used for rendering discretized voxel grids,would effectively limit our representation’s resolution because the MLP wouldonly be queried at a fixed discrete set of locations. Instead, we use a stratifiedsampling approach where we partition [tn, tf ] into N evenly-spaced bins andthen draw one sample uniformly at random from within each bin:

ti ∼ U[tn +

i− 1

N(tf − tn), tn +

i

N(tf − tn)

]. (2)

Although we use a discrete set of samples to estimate the integral, stratifiedsampling enables us to represent a continuous scene representation because itresults in the MLP being evaluated at continuous positions over the course ofoptimization. We use these samples to estimate C(r) with the quadrature rulediscussed in the volume rendering review by Max [18]:

C(r) =

N∑i=1

Ti(1− exp(−σiδi))ci , where Ti = exp

− i−1∑j=1

σjδj

, (3)

where δi = ti+1 − ti is the distance between adjacent samples. This functionfor calculating C(r) from the set of (ci, σi) values is trivially differentiable andreduces to traditional alpha compositing with alpha values αi = 1− exp(−σiδi).


Ground Truth Complete Model No View Dependence No Positional Encoding

Fig. 4: Here we visualize how our full model benefits from representing view-dependent emitted radiance and from passing our input coordinates througha high-frequency positional encoding. Removing view dependence prevents themodel from recreating the specular reflection on the bulldozer tread. Removingthe positional encoding drastically decreases the model’s ability to represent highfrequency geometry and texture, resulting in an oversmoothed appearance.

5 Optimizing a Neural Radiance Field

In the previous section we have described the core components necessary formodeling a scene as a neural radiance field and rendering novel views from thisrepresentation. However, we observe that these components are not sufficient forachieving state-of-the-art quality, as demonstrated in Section 6.4). We introducetwo improvements to enable representing high-resolution complex scenes. Thefirst is a positional encoding of the input coordinates that assists the MLP inrepresenting high-frequency functions, and the second is a hierarchical samplingprocedure that allows us to efficiently sample this high-frequency representation.

5.1 Positional encoding

Despite the fact that neural networks are universal function approximators [9],we found that having the network FΘ directly operate on xyzθφ input coordi-nates results in renderings that perform poorly at representing high-frequencyvariation in color and geometry. This is consistent with recent work by Rahamanet al. [26], which shows that deep networks are biased towards learning lower fre-quency functions. They additionally show that mapping the inputs to a higherdimensional space using high frequency functions before passing them to thenetwork enables better fitting of data that contains high frequency variation.

We leverage these findings in the context of neural scene representations, andshow that reformulating FΘ as a composition of two functions FΘ = F ′Θ ◦ γ, onelearned and one not, significantly improves performance (see Fig. 4 and Table 2).Here γ is a mapping from R into a higher dimensional space R2L, and F ′Θ is stillsimply a regular MLP. Formally, the encoding function we use is:

γ(p) =(

sin(20πp

), cos

(20πp

), · · · , sin

(2L−1πp

), cos

(2L−1πp

) ). (4)

This function γ(·) is applied separately to each of the three coordinate valuesin x (which are normalized to lie in [−1, 1]) and to the three components of the


Cartesian viewing direction unit vector d (which by construction lie in [−1, 1]).In our experiments, we set L = 10 for γ(x) and L = 4 for γ(d).

A similar mapping is used in the popular Transformer architecture [33], whereit is referred to as a positional encoding. However, Transformers use it for adifferent goal of providing the discrete positions of tokens in a sequence as inputto an architecture that does not contain any notion of order. In contrast, we usethese functions to map continuous input coordinates into a higher dimensionalspace to enable our MLP to more easily approximate a higher frequency function.

5.2 Hierarchical volume sampling

Our rendering strategy of densely evaluating the neural radiance field networkat N query points along each camera ray is inefficient: free space and occludedregions that do not contribute to the rendered image are still sampled repeat-edly. We draw inspiration from early work in volume rendering [13] and proposea hierarchical representation that increases rendering efficiency by allocatingsamples proportionally to their expected effect on the final rendering.

Instead of just using a single network to represent the scene, we simultane-ously optimize two networks: one “coarse” and one “fine”. We first sample a setof Nc locations using stratified sampling, and evaluate the “coarse” network atthese locations as described in Eqns. 2 and 3. Given the output of this “coarse”network, we then produce a more informed sampling of points along each raywhere samples are biased towards the relevant parts of the volume. To do this,we first rewrite the alpha composited color from the coarse network Cc(r) inEqn. 3 as a weighted sum of all sampled colors ci along the ray:

Cc(r) =

Nc∑i=1

wici , wi = Ti(1− exp(−σiδi)) . (5)

Normalizing these weights as wi = wi/∑Nc

j=1 wj produces a piecewise-constantPDF along the ray. We sample a second set of Nf locations from this distributionusing inverse transform sampling, evaluate our “fine” network at the union of thefirst and second set of samples, and compute the final rendered color of the rayCf (r) using Eqn. 3 but using all Nc+Nf samples. This procedure allocates moresamples to regions we expect to contain visible content. This addresses a similargoal as importance sampling, but we use the sampled values as a nonuniformdiscretization of the whole integration domain rather than treating each sampleas an independent probabilistic estimate of the entire integral.

5.3 Implementation details

We optimize a separate neural continuous volume representation network foreach scene. This requires only a dataset of captured RGB images of the scene,the corresponding camera poses and intrinsic parameters, and scene bounds(we use ground truth camera poses, intrinsics, and bounds for synthetic data,


and use the COLMAP structure-from-motion package [27] to estimate theseparameters for real data). At each optimization iteration, we randomly samplea batch of camera rays from the set of all pixels in the dataset, and then followthe hierarchical sampling described in Sec. 5.2 to query Nc samples from thecoarse network and Nc + Nf samples from the fine network. We then use thevolume rendering procedure described in Sec. 4 to render the color of each rayfrom both sets of samples. Our loss is simply the total squared error betweenthe rendered and true pixel colors for both the coarse and fine renderings:

L =∑r∈R

[∥∥∥Cc(r)− C(r)∥∥∥2

2+∥∥∥Cf (r)− C(r)

∥∥∥2

2

](6)

where R is the set of rays in each batch, and C(r), Cc(r), and Cf (r) are theground truth, coarse volume predicted, and fine volume predicted RGB colorsfor ray r respectively. Note that even though the final rendering comes fromCf (r), we also minimize the loss of Cc(r) so that the weight distribution fromthe coarse network can be used to allocate samples in the fine network.

In our experiments, we use a batch size of 4096 rays, each sampled at Nc = 64coordinates in the coarse volume and Nf = 128 additional coordinates in thefine volume. We use the Adam optimizer [11] with a learning rate that begins at5 × 10−4 and decays exponentially to 5 × 10−5 over the course of optimization(other Adam hyperparameters are left at default values of β1 = 0.9, β2 = 0.999,and ε = 10−7). The optimization for a single scene typically take around 100–300k iterations to converge on a single NVIDIA V100 GPU (about 1–2 days).

6 Results

We quantitatively (Tables 1) and qualitatively (Figs. 8 and 6) show that ourmethod outperforms prior work, and provide extensive ablation studies to vali-date our design choices (Table 2). We urge the reader to view our supplementaryvideo to better appreciate our method’s significant improvement over baselinemethods when rendering smooth paths of novel views.

6.1 Datasets

Synthetic renderings of objects We first show experimental results on twodatasets of synthetic renderings of objects (Table 1, “Diffuse Synthetic 360◦” and“Realistic Synthetic 360◦”). The DeepVoxels [29] dataset contains four Lamber-tian objects with relatively simple geometry. Each object is rendered at 512×512pixels from viewpoints sampled on the upper hemisphere (479 as input and 1000for testing). We additionally generate our own dataset containing pathtracedimages of eight objects that exhibit complicated geometry and realistic non-Lambertian materials. Six are rendered from viewpoints sampled on the upperhemisphere, and two are rendered from viewpoints sampled on a full sphere. Werender 100 views of each scene as input and 200 for testing, all at 800 × 800pixels.


Diffuse Synthetic 360◦ [29] Realistic Synthetic 360◦ Real Forward-Facing [20]Method PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓SRN [30] 33.20 0.986 0.073 22.26 0.867 0.170 22.84 0.866 0.378NV [16] 29.62 0.946 0.099 26.05 0.944 0.160 - - -LLFF [20] 34.38 0.995 0.048 24.88 0.935 0.114 24.13 0.909 0.212Ours 40.15 0.998 0.023 31.01 0.977 0.081 26.50 0.935 0.250

Table 1: Our method quantitatively outperforms prior work on datasets ofboth synthetic and real images. We report PSNR/SSIM (higher is better) andLPIPS [36] (lower is better). The DeepVoxels [29] dataset consists of 4 diffuse ob-jects with simple geometry. Our realistic synthetic dataset consists of pathtracedrenderings of 8 geometrically complex objects with complex non-Lambertian ma-terials. The real dataset consists of handheld forward-facing captures of 8 real-world scenes (NV cannot be evaluated on this data because it only reconstructsobjects inside a bounded volume). Though LLFF achieves slightly better LPIPS,we urge readers to view our supplementary video where our method achievesbetter multiview consistency and produces fewer artifacts than all baselines.

Real images of complex scenes We show results on complex real-worldscenes captured with roughly forward-facing images (Table 1, “Real Forward-Facing”). This dataset consists of 8 scenes captured with a handheld cellphone(5 taken from the LLFF paper and 3 that we capture), captured with 20 to 62images, and hold out 1/8 of these for the test set. All images are 1008×756 pixels.

6.2 Comparisons

To evaluate our model we compare against current top-performing techniquesfor view synthesis, detailed below. All methods use the same set of input viewsto train a separate network for each scene except Local Light Field Fusion [20],which trains a single 3D convolutional network on a large dataset, then uses thesame trained network to process input images of new scenes at test time.

Neural Volumes (NV) [16] synthesizes novel views of objects that lie en-tirely within a bounded volume in front of a distinct background (which mustbe separately captured without the object of interest). It optimizes a deep 3Dconvolutional network to predict a discretized RGBα voxel grid with 1283 sam-ples as well as a 3D warp grid with 323 samples. The algorithm renders novelviews by marching camera rays through the warped voxel grid.

Scene Representation Networks (SRN) [30] represent a continuous sceneas an opaque surface, implicitly defined by a MLP that maps each (x, y, z) co-ordinate to a feature vector. They train a recurrent neural network to marchalong a ray through the scene representation by using the feature vector at any3D coordinate to predict the next step size along the ray. The feature vectorfrom the final step is decoded into a single color for that point on the surface.


Ship

Lego

Microphone

Materials

Ground Truth NeRF (ours) LLFF [20] SRN [30] NV [16]

Fig. 5: Comparisons on test-set views for scenes from our new synthetic datasetgenerated with a physically-based renderer. Our method is able to recover finedetails in both geometry and appearance, such as Ship’s rigging, Lego’s gearand treads, Microphone’s shiny stand and mesh grille, and Material ’s non-Lambertian reflectance. LLFF exhibits banding artifacts on the Microphonestand and Material ’s object edges and ghosting artifacts in Ship’s mast andinside the Lego object. SRN produces blurry and distorted renderings in everycase. Neural Volumes cannot capture the details on the Microphone’s grille orLego’s gears, and it completely fails to recover the geometry of Ship’s rigging.


Fern

T-Rex

Orchid

Ground Truth NeRF (ours) LLFF [20] SRN [30]

Fig. 6: Comparisons on test-set views of real world scenes. LLFF is specificallydesigned for this use case (forward-facing captures of real scenes). Our methodis able to represent fine geometry more consistently across rendered views thanLLFF, as shown in Fern’s leaves and the skeleton ribs and railing in T-rex.Our method also correctly reconstructs partially occluded regions that LLFFstruggles to render cleanly, such as the yellow shelves behind the leaves in thebottom Fern crop and green leaves in the background of the bottom Orchid crop.Blending between multiples renderings can also cause repeated edges in LLFF,as seen in the top Orchid crop. SRN captures the low-frequency geometry andcolor variation in each scene but is unable to reproduce any fine detail.


Note that SRN is a better-performing followup to DeepVoxels [29] by the sameauthors, which is why we do not include comparisons to DeepVoxels.

Local Light Field Fusion (LLFF) [20] LLFF is designed for producing pho-torealistic novel views for well-sampled forward facing scenes. It uses a trained 3Dconvolutional network to directly predict a discretized frustum-sampled RGBαgrid (multiplane image or MPI [37]) for each input view, then renders novelviews by alpha compositing and blending nearby MPIs into the novel viewpoint.

6.3 Discussion

We thoroughly outperform both baselines that also optimize a separate networkper scene (NV and SRN) in all scenarios. Furthermore, we produce qualitativelyand quantitatively superior renderings compared to LLFF (across all except onemetric) while using only their input images as our entire training set.

The SRN method produces heavily smoothed geometry and texture, and itsrepresentational power for view synthesis is limited by selecting only a singledepth and color per camera ray. The NV baseline is able to capture reasonablydetailed volumetric geometry and appearance, but its use of an underlying ex-plicit 1283 voxel grid prevents it from scaling to represent fine details at highresolutions. LLFF specifically provides a “sampling guideline” to not exceed 64pixels of disparity between input views, so it frequently fails to estimate cor-rect geometry in the synthetic datasets which contain up to 400-500 pixels ofdisparity between views. Additionally, LLFF blends between different scene rep-resentations for rendering different views, resulting in perceptually-distractinginconsistency as is apparent in our supplementary video.

The biggest practical tradeoffs between these methods are time versus space.All compared single scene methods take at least 12 hours to train per scene. Incontrast, LLFF can process a small input dataset in under 10 minutes. However,LLFF produces a large 3D voxel grid for every input image, resulting in enor-mous storage requirements (over 15GB for one “Realistic Synthetic” scene). Ourmethod requires only 5 MB for the network weights (a relative compression of3000× compared to LLFF), which is even less memory than the input imagesalone for a single scene from any of our datasets.

6.4 Ablation studies

We validate our algorithm’s design choices and parameters with an extensiveablation study in Table 2. We present results on one of our synthetic sceneswith complex geometry and non-Lambertian materials (Lego). Row 9 shows ourcomplete model as a point of reference. Row 1 shows a minimalist version of ourmodel without positional encoding (PE), view-dependence (VD), or hierarchicalsampling (H). In rows 2–4 we remove these three components one at a time fromthe full model, observing that positional encoding provides the largest quanti-tative benefit of these three contributions (row 2), followed by view-dependence


Input #Im. L (N , Nf ) PSNR↑ SSIM↑ LPIPS↓1) No PE, VD, H xyz 100 - (256, - ) 26.38 0.938 0.1782) No Pos. Encoding xyzθφ 100 - (64, 128) 27.75 0.955 0.1283) No View Dependence xyz 100 10 (64, 128) 29.93 0.980 0.0884) No Hierarchical xyzθφ 100 10 (256, - ) 31.42 0.985 0.072

5) Far Fewer Images xyzθφ 25 10 (64, 128) 27.97 0.967 0.0816) Fewer Images xyzθφ 50 10 (64, 128) 31.53 0.985 0.055

7) Fewer Frequencies xyzθφ 100 5 (64, 128) 30.77 0.981 0.0718) More Frequencies xyzθφ 100 15 (64, 128) 32.50 0.988 0.050

9) Complete Model xyzθφ 100 10 (64, 128) 32.54 0.988 0.050

Table 2: An ablation study of our model for the Lego scene from our realisticsynthetic dataset. See Sec. 6.4 for detailed descriptions of each ablation.

(row 3), and then hierarchical sampling (row 4). Rows 5–6 show how our per-formance decreases as the number of input images is reduced. Note that ourmethod’s performance using only 25 input images still exceeds NV, SRN, andLLFF across all metrics when they are provided with 100 images (see supple-mentary material). In rows 7–8 we validate our choice of the maximum frequencyL used in our positional encoding for x (the maximum frequency used for d isscaled proportionally). Only using 5 frequencies reduces performance, but in-creasing the number of frequencies from 10 to 15 does not improve performance.We believe the benefit of increasing L is limited once 2L exceeds the maximumfrequency present in the sampled input images (roughly 1024 in our data).

7 Conclusion

Our work directly addresses deficiencies of prior work that uses MLPs to repre-sent objects and scenes as continuous functions. We demonstrate that represent-ing scenes as 5D neural radiance fields (an MLP that outputs volume density andview-dependent emitted radiance as a function of 3D location and 2D viewingdirection) produces better renderings than the previously-dominant approach oftraining deep convolutional networks to output discretized voxel representations.

Although we have proposed a hierarchical sampling strategy to make render-ing more sample-efficient (for both training and testing), there is still much moreprogress to be made in investigating techniques to efficiently optimize and ren-der neural radiance fields. Another direction for future work is interpretability:sampled representations such as voxel grids and meshes admit reasoning aboutthe expected quality of rendered views and failure modes, but it is unclear howto analyze these issues when we encode scenes in the weights of a deep neuralnetwork. We believe that this work makes progress towards a graphics pipelinebased on real world imagery, where complex scenes could be composed of neuralradiance fields optimized from images of actual objects and scenes.


Acknowledgements

We thank Kevin Cao, Guowei Frank Yang, and Nithin Raghavan for commentsand discussions. This work is partially funded by ONR grant N000141712687.BM is funded by a Hertz Foundation Fellowship and MT is funded by anNSF Graduate Fellowship. Google provided a generous donation of cloud com-pute credits through the BAIR Commons program. We thank the followingBlend Swap users for the models used in our realistic synthetic dataset: gregzaal(ship), 1DInc (chair), bryanajones (drums), Herberhold (ficus), erickfree (hot-dog), Heinzelnisse (lego), elbrujodelatribu (materials), and up3d.de (mic).

References

1. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado,G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A.,Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg,J., Mane, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J.,Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V.,Viegas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng,X.: TensorFlow: Large-scale machine learning on heterogeneous systems (2015)

2. Buehler, C., Bosse, M., McMillan, L., Gortler, S., Cohen, M.: Unstructured lumi-graph rendering. In: SIGGRAPH (2001)

3. Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z.,Savarese, S., Savva, M., Song, S., Su, H., et al.: Shapenet: An information-rich3d model repository. arXiv:1512.03012 (2015)

4. Chen, W., Gao, J., Ling, H., Smith, E.J., Lehtinen, J., Jacobson, A., Fidler, S.:Learning to predict 3D objects with an interpolation-based differentiable renderer.In: NeurIPS (2019)

5. Curless, B., Levoy, M.: A volumetric method for building complex models fromrange images. In: SIGGRAPH (1996)

6. Debevec, P., Taylor, C.J., Malik, J.: Modeling and rendering architecture from pho-tographs: A hybrid geometry-and image-based approach. In: SIGGRAPH (1996)

7. Flynn, J., Broxton, M., Debevec, P., DuVall, M., Fyffe, G., Overbeck, R., Snavely,N., Tucker, R.: DeepView: view synthesis with learned gradient descent. In: CVPR(2019)

8. Genova, K., Cole, F., Maschinot, A., Sarna, A., Vlasic, D., , Freeman, W.T.: Un-supervised training for 3D morphable model regression. In: CVPR (2018)

9. Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are uni-versal approximators. Neural Networks (1989)

10. Kajiya, J.T., Herzen, B.P.V.: Ray tracing volume densities. Computer Graphics(SIGGRAPH) (1984)

11. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: ICLR(2015)

12. Kutulakos, K.N., Seitz, S.M.: A theory of shape by space carving. InternationalJournal of Computer Vision (2000)

13. Levoy, M.: Efficient ray tracing of volume data. ACM Transactions on Graphics(1990)


14. Li, T.M., Aittala, M., Durand, F., Lehtinen, J.: Differentiable monte carlo raytracing through edge sampling. ACM Transactions on Graphics (SIGGRAPH Asia)(2018)

15. Liu, S., Li, T., Chen, W., Li, H.: Soft rasterizer: A differentiable renderer for image-based 3D reasoning. In: ICCV (2019)

16. Lombardi, S., Simon, T., Saragih, J., Schwartz, G., Lehrmann, A., Sheikh, Y.:Neural volumes: Learning dynamic renderable volumes from images. ACM Trans-actions on Graphics (SIGGRAPH) (2019)

17. Loper, M.M., Black, M.J.: OpenDR: An approximate differentiable renderer. In:ECCV (2014)

18. Max, N.: Optical models for direct volume rendering. IEEE Transactions on Visu-alization and Computer Graphics (1995)

19. Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancynetworks: Learning 3D reconstruction in function space. In: CVPR (2019)

20. Mildenhall, B., Srinivasan, P.P., Ortiz-Cayon, R., Kalantari, N.K., Ramamoorthi,R., Ng, R., Kar, A.: Local light field fusion: Practical view synthesis with prescrip-tive sampling guidelines. ACM Transactions on Graphics (SIGGRAPH) (2019)

21. Niemeyer, M., Mescheder, L., Oechsle, M., Geiger, A.: Differentiable volumetricrendering: Learning implicit 3D representations without 3D supervision (2019)

22. Nimier-David, M., Vicini, D., Zeltner, T., Jakob, W.: Mitsuba 2: A retargetable for-ward and inverse renderer. Transactions on Graphics (Proceedings of SIGGRAPHAsia) (2019)

23. Park, J.J., Florence, P., Straub, J., Newcombe, R., Lovegrove, S.: DeepSDF: Learn-ing continuous signed distance functions for shape representation. In: CVPR (2019)

24. Penner, E., Zhang, L.: Soft 3D reconstruction for view synthesis. ACM Transactionson Graphics (SIGGRAPH Asia) (2017)

25. Porter, T., Duff, T.: Compositing digital images. Computer Graphics (SIG-GRAPH) (1984)

26. Rahaman, N., Baratin, A., Arpit, D., Draxler, F., Lin, M., Hamprecht, F.A., Ben-gio, Y., Courville, A.C.: On the spectral bias of neural networks. In: ICML (2018)

27. Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR (2016)28. Seitz, S.M., Dyer, C.R.: Photorealistic scene reconstruction by voxel coloring. In-

ternational Journal of Computer Vision (1999)29. Sitzmann, V., Thies, J., Heide, F., Nießner, M., Wetzstein, G., Zollhofer, M.: Deep-

voxels: Learning persistent 3D feature embeddings. In: CVPR (2019)30. Sitzmann, V., Zollhoefer, M., Wetzstein, G.: Scene representation networks: Con-

tinuous 3D-structure-aware neural scene representations. In: NeurIPS (2019)31. Srinivasan, P.P., Tucker, R., Barron, J.T., Ramamoorthi, R., Ng, R., Snavely, N.:

Pushing the boundaries of view extrapolation with multiplane images. In: CVPR(2019)

32. Szeliski, R., Golland, P.: Stereo matching with transparency and matting. In: ICCV(1998)

33. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NeurIPS (2017)

34. Waechter, M., Moehrle, N., Goesele, M.: Let there be color! Large-scale texturingof 3D reconstructions. In: ECCV (2014)

35. Wood, D.N., Azuma, D.I., Aldinger, K., Curless, B., Duchamp, T., Salesin, D.H.,Stuetzle, W.: Surface light fields for 3D photography. In: SIGGRAPH (2000)

36. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonableeffectiveness of deep features as a perceptual metric. In: CVPR (2018)


37. Zhou, T., Tucker, R., Flynn, J., Fyffe, G., Snavely, N.: Stereo magnification: Learn-ing view synthesis using multiplane images. ACM Transactions on Graphics (SIG-GRAPH) (2018)

A Additional Implementation Details

Network Architecture Fig. 7 details our simple fully-connected architecture.

Volume Bounds Our method renders views by querying the neural radiancefield representation at continuous 5D coordinates along camera rays. For exper-iments with synthetic images, we scale the scene so that it lies within a cube ofside length 2 centered at the origin, and only query the representation withinthis bounding volume. Our dataset of real images contains content that can ex-ist anywhere between the closest point and infinity, so we use normalized devicecoordinates to map the depth range of these points into [−1, 1]. This shifts allthe ray origins to the near plane of the scene, maps the perspective rays of thecamera to parallel rays in the transformed volume, and uses disparity (inversedepth) instead of metric depth, so all coordinates are now bounded.

Training Details For real scene data, we regularize our network by addingrandom Gaussian noise with zero mean and unit variance to the output σ values(before passing them through the ReLU) during optimization, finding that thisslightly improves visual performance for rendering novel views. We implementour model in Tensorflow [1].

Rendering Details To render new views at test time, we sample 64 points perray through the coarse network and 64 + 128 = 192 points per ray through thefine network, for a total of 256 network queries per ray. Our realistic syntheticdataset requires 640k rays per image, and our real scenes require 762k rays perimage, resulting in between 150 and 200 million network queries per renderedimage. On an NVIDIA V100, this takes approximately 30 seconds per frame.

B Additional Baseline Method Details

Neural Volumes (NV) [16] We use the NV code open-sourced by the au-thors1 and follow their procedure for training on a single scene without timedependence.

Scene Representation Networks (SRN) [30] We use the SRN code open-sourced by the authors2 and follow their procedure for training on a single scene.

Local Light Field Fusion (LLFF) [20] We use the pretrained LLFF modelopen-sourced by the authors3.

1 https://github.com/facebookresearch/neuralvolumes2 https://github.com/vsitzmann/scene-representation-networks3 https://github.com/Fyusion/LLFF

https://github.com/facebookresearch/neuralvolumes

https://github.com/vsitzmann/scene-representation-networks

https://github.com/Fyusion/LLFF


RGB�(x)

<latexit sha1_base64="6F05rQ6IIUhALFsWYUmKYX8h5zw=">AAAB+nicbVBPS8MwHE3nvzn/1Xn0EhzCvIxWBPU29OJxgnWDtYw0S7ewJC1JKhulX8WLBxWvfhJvfhvTrQfdfBB4vPf78Xt5YcKo0o7zbVXW1jc2t6rbtZ3dvf0D+7D+qOJUYuLhmMWyFyJFGBXE01Qz0kskQTxkpBtObgu/+0SkorF40LOEBByNBI0oRtpIA7vujxDnqOlzpMdhlE3zs4HdcFrOHHCVuCVpgBKdgf3lD2OcciI0ZkipvuskOsiQ1BQzktf8VJEE4Qkakb6hAnGigmyePYenRhnCKJbmCQ3n6u+NDHGlZjw0k0VEtewV4n9eP9XRVZBRkaSaCLw4FKUM6hgWRcAhlQRrNjMEYUlNVojHSCKsTV01U4K7/OVV4p23rlvO/UWjfVO2UQXH4AQ0gQsuQRvcgQ7wAAZT8AxewZuVWy/Wu/WxGK1Y5c4R+APr8wc/95Qg</latexit><latexit sha1_base64="6F05rQ6IIUhALFsWYUmKYX8h5zw=">AAAB+nicbVBPS8MwHE3nvzn/1Xn0EhzCvIxWBPU29OJxgnWDtYw0S7ewJC1JKhulX8WLBxWvfhJvfhvTrQfdfBB4vPf78Xt5YcKo0o7zbVXW1jc2t6rbtZ3dvf0D+7D+qOJUYuLhmMWyFyJFGBXE01Qz0kskQTxkpBtObgu/+0SkorF40LOEBByNBI0oRtpIA7vujxDnqOlzpMdhlE3zs4HdcFrOHHCVuCVpgBKdgf3lD2OcciI0ZkipvuskOsiQ1BQzktf8VJEE4Qkakb6hAnGigmyePYenRhnCKJbmCQ3n6u+NDHGlZjw0k0VEtewV4n9eP9XRVZBRkaSaCLw4FKUM6hgWRcAhlQRrNjMEYUlNVojHSCKsTV01U4K7/OVV4p23rlvO/UWjfVO2UQXH4AQ0gQsuQRvcgQ7wAAZT8AxewZuVWy/Wu/WxGK1Y5c4R+APr8wc/95Qg</latexit><latexit sha1_base64="6F05rQ6IIUhALFsWYUmKYX8h5zw=">AAAB+nicbVBPS8MwHE3nvzn/1Xn0EhzCvIxWBPU29OJxgnWDtYw0S7ewJC1JKhulX8WLBxWvfhJvfhvTrQfdfBB4vPf78Xt5YcKo0o7zbVXW1jc2t6rbtZ3dvf0D+7D+qOJUYuLhmMWyFyJFGBXE01Qz0kskQTxkpBtObgu/+0SkorF40LOEBByNBI0oRtpIA7vujxDnqOlzpMdhlE3zs4HdcFrOHHCVuCVpgBKdgf3lD2OcciI0ZkipvuskOsiQ1BQzktf8VJEE4Qkakb6hAnGigmyePYenRhnCKJbmCQ3n6u+NDHGlZjw0k0VEtewV4n9eP9XRVZBRkaSaCLw4FKUM6hgWRcAhlQRrNjMEYUlNVojHSCKsTV01U4K7/OVV4p23rlvO/UWjfVO2UQXH4AQ0gQsuQRvcgQ7wAAZT8AxewZuVWy/Wu/WxGK1Y5c4R+APr8wc/95Qg</latexit><latexit sha1_base64="C39OhB+IczRcjLNINXH29e9lt8M=">AAAB2HicbZDNSgMxFIXv1L86Vq1rN8EiuCpTN+pOcOOygmML7VAymTttaCYzJHeEMvQFXLhRfDB3vo3pz0KtBwIf5yTk3hMXSloKgi+vtrW9s7tX3/cPGv7h0XGz8WTz0ggMRa5y04+5RSU1hiRJYb8wyLNYYS+e3i3y3jMaK3P9SLMCo4yPtUyl4OSs7qjZCtrBUmwTOmtowVqj5ucwyUWZoSahuLWDTlBQVHFDUiic+8PSYsHFlI9x4FDzDG1ULcecs3PnJCzNjTua2NL9+aLimbWzLHY3M04T+zdbmP9lg5LS66iSuigJtVh9lJaKUc4WO7NEGhSkZg64MNLNysSEGy7INeO7Djp/N96E8LJ90w4eAqjDKZzBBXTgCm7hHroQgoAEXuDNm3iv3vuqqpq37uwEfsn7+Aap5IoM</latexit><latexit sha1_base64="hQhj+CULGdG3QS0LVa1KRUGPisw=">AAAB73icbVA9T8MwFHwpX6UUCF1ZLCqkslQJC7AhsTAWidBKbVQ5rtNatZ3IdlCrKH+FhQEQ/4aNf4PTdoCWkyyd7t7TO1+UcqaN5307la3tnd296n7toH54dOye1J90kilCA5LwRPUirClnkgaGGU57qaJYRJx2o+ld6XefqdIskY9mntJQ4LFkMSPYWGnoNgZjLARuDQQ2kyjOZ8XF0G16bW8BtEn8FWnCCp2h+zUYJSQTVBrCsdZ930tNmGNlGOG0qA0yTVNMpnhM+5ZKLKgO80X2Ap1bZYTiRNknDVqovzdyLLSei8hOlhH1uleK/3n9zMTXYc5kmhkqyfJQnHFkElQWgUZMUWL43BJMFLNZEZlghYmxddVsCf76lzdJcNm+aXsPHlThFM6gBT5cwS3cQwcCIDCDF3iDd6dwXp2PZVsVZ1VbA/7A+fwB4CiSuw==</latexit><latexit sha1_base64="hQhj+CULGdG3QS0LVa1KRUGPisw=">AAAB73icbVA9T8MwFHwpX6UUCF1ZLCqkslQJC7AhsTAWidBKbVQ5rtNatZ3IdlCrKH+FhQEQ/4aNf4PTdoCWkyyd7t7TO1+UcqaN5307la3tnd296n7toH54dOye1J90kilCA5LwRPUirClnkgaGGU57qaJYRJx2o+ld6XefqdIskY9mntJQ4LFkMSPYWGnoNgZjLARuDQQ2kyjOZ8XF0G16bW8BtEn8FWnCCp2h+zUYJSQTVBrCsdZ930tNmGNlGOG0qA0yTVNMpnhM+5ZKLKgO80X2Ap1bZYTiRNknDVqovzdyLLSei8hOlhH1uleK/3n9zMTXYc5kmhkqyfJQnHFkElQWgUZMUWL43BJMFLNZEZlghYmxddVsCf76lzdJcNm+aXsPHlThFM6gBT5cwS3cQwcCIDCDF3iDd6dwXp2PZVsVZ1VbA/7A+fwB4CiSuw==</latexit><latexit sha1_base64="Wrp6sfGRkT1YIiwCAWhsM6HtC1M=">AAAB+nicbVA9T8MwFHzhs5SvUEaWiAqpLFXKAmwVLIxFIrRSE1WO67RWbSeyHdQqyl9hYQDEyi9h49/gtBmg5SRLp7v39M4XJowq7brf1tr6xubWdmWnuru3f3BoH9UeVZxKTDwcs1j2QqQIo4J4mmpGeokkiIeMdMPJbeF3n4hUNBYPepaQgKORoBHFSBtpYNf8EeIcNXyO9DiMsml+PrDrbtOdw1klrZLUoURnYH/5wxinnAiNGVKq33ITHWRIaooZyat+qkiC8ASNSN9QgThRQTbPnjtnRhk6USzNE9qZq783MsSVmvHQTBYR1bJXiP95/VRHV0FGRZJqIvDiUJQyR8dOUYQzpJJgzWaGICypyergMZIIa1NX1ZTQWv7yKvEumtdN996tt2/KNipwAqfQgBZcQhvuoAMeYJjCM7zCm5VbL9a79bEYXbPKnWP4A+vzBz63lBw=</latexit><latexit sha1_base64="6F05rQ6IIUhALFsWYUmKYX8h5zw=">AAAB+nicbVBPS8MwHE3nvzn/1Xn0EhzCvIxWBPU29OJxgnWDtYw0S7ewJC1JKhulX8WLBxWvfhJvfhvTrQfdfBB4vPf78Xt5YcKo0o7zbVXW1jc2t6rbtZ3dvf0D+7D+qOJUYuLhmMWyFyJFGBXE01Qz0kskQTxkpBtObgu/+0SkorF40LOEBByNBI0oRtpIA7vujxDnqOlzpMdhlE3zs4HdcFrOHHCVuCVpgBKdgf3lD2OcciI0ZkipvuskOsiQ1BQzktf8VJEE4Qkakb6hAnGigmyePYenRhnCKJbmCQ3n6u+NDHGlZjw0k0VEtewV4n9eP9XRVZBRkaSaCLw4FKUM6hgWRcAhlQRrNjMEYUlNVojHSCKsTV01U4K7/OVV4p23rlvO/UWjfVO2UQXH4AQ0gQsuQRvcgQ7wAAZT8AxewZuVWy/Wu/WxGK1Y5c4R+APr8wc/95Qg</latexit><latexit sha1_base64="6F05rQ6IIUhALFsWYUmKYX8h5zw=">AAAB+nicbVBPS8MwHE3nvzn/1Xn0EhzCvIxWBPU29OJxgnWDtYw0S7ewJC1JKhulX8WLBxWvfhJvfhvTrQfdfBB4vPf78Xt5YcKo0o7zbVXW1jc2t6rbtZ3dvf0D+7D+qOJUYuLhmMWyFyJFGBXE01Qz0kskQTxkpBtObgu/+0SkorF40LOEBByNBI0oRtpIA7vujxDnqOlzpMdhlE3zs4HdcFrOHHCVuCVpgBKdgf3lD2OcciI0ZkipvuskOsiQ1BQzktf8VJEE4Qkakb6hAnGigmyePYenRhnCKJbmCQ3n6u+NDHGlZjw0k0VEtewV4n9eP9XRVZBRkaSaCLw4FKUM6hgWRcAhlQRrNjMEYUlNVojHSCKsTV01U4K7/OVV4p23rlvO/UWjfVO2UQXH4AQ0gQsuQRvcgQ7wAAZT8AxewZuVWy/Wu/WxGK1Y5c4R+APr8wc/95Qg</latexit><latexit sha1_base64="6F05rQ6IIUhALFsWYUmKYX8h5zw=">AAAB+nicbVBPS8MwHE3nvzn/1Xn0EhzCvIxWBPU29OJxgnWDtYw0S7ewJC1JKhulX8WLBxWvfhJvfhvTrQfdfBB4vPf78Xt5YcKo0o7zbVXW1jc2t6rbtZ3dvf0D+7D+qOJUYuLhmMWyFyJFGBXE01Qz0kskQTxkpBtObgu/+0SkorF40LOEBByNBI0oRtpIA7vujxDnqOlzpMdhlE3zs4HdcFrOHHCVuCVpgBKdgf3lD2OcciI0ZkipvuskOsiQ1BQzktf8VJEE4Qkakb6hAnGigmyePYenRhnCKJbmCQ3n6u+NDHGlZjw0k0VEtewV4n9eP9XRVZBRkaSaCLw4FKUM6hgWRcAhlQRrNjMEYUlNVojHSCKsTV01U4K7/OVV4p23rlvO/UWjfVO2UQXH4AQ0gQsuQRvcgQ7wAAZT8AxewZuVWy/Wu/WxGK1Y5c4R+APr8wc/95Qg</latexit><latexit sha1_base64="6F05rQ6IIUhALFsWYUmKYX8h5zw=">AAAB+nicbVBPS8MwHE3nvzn/1Xn0EhzCvIxWBPU29OJxgnWDtYw0S7ewJC1JKhulX8WLBxWvfhJvfhvTrQfdfBB4vPf78Xt5YcKo0o7zbVXW1jc2t6rbtZ3dvf0D+7D+qOJUYuLhmMWyFyJFGBXE01Qz0kskQTxkpBtObgu/+0SkorF40LOEBByNBI0oRtpIA7vujxDnqOlzpMdhlE3zs4HdcFrOHHCVuCVpgBKdgf3lD2OcciI0ZkipvuskOsiQ1BQzktf8VJEE4Qkakb6hAnGigmyePYenRhnCKJbmCQ3n6u+NDHGlZjw0k0VEtewV4n9eP9XRVZBRkaSaCLw4FKUM6hgWRcAhlQRrNjMEYUlNVojHSCKsTV01U4K7/OVV4p23rlvO/UWjfVO2UQXH4AQ0gQsuQRvcgQ7wAAZT8AxewZuVWy/Wu/WxGK1Y5c4R+APr8wc/95Qg</latexit><latexit sha1_base64="6F05rQ6IIUhALFsWYUmKYX8h5zw=">AAAB+nicbVBPS8MwHE3nvzn/1Xn0EhzCvIxWBPU29OJxgnWDtYw0S7ewJC1JKhulX8WLBxWvfhJvfhvTrQfdfBB4vPf78Xt5YcKo0o7zbVXW1jc2t6rbtZ3dvf0D+7D+qOJUYuLhmMWyFyJFGBXE01Qz0kskQTxkpBtObgu/+0SkorF40LOEBByNBI0oRtpIA7vujxDnqOlzpMdhlE3zs4HdcFrOHHCVuCVpgBKdgf3lD2OcciI0ZkipvuskOsiQ1BQzktf8VJEE4Qkakb6hAnGigmyePYenRhnCKJbmCQ3n6u+NDHGlZjw0k0VEtewV4n9eP9XRVZBRkaSaCLw4FKUM6hgWRcAhlQRrNjMEYUlNVojHSCKsTV01U4K7/OVV4p23rlvO/UWjfVO2UQXH4AQ0gQsuQRvcgQ7wAAZT8AxewZuVWy/Wu/WxGK1Y5c4R+APr8wc/95Qg</latexit>

�(x)<latexit sha1_base64="6F05rQ6IIUhALFsWYUmKYX8h5zw=">AAAB+nicbVBPS8MwHE3nvzn/1Xn0EhzCvIxWBPU29OJxgnWDtYw0S7ewJC1JKhulX8WLBxWvfhJvfhvTrQfdfBB4vPf78Xt5YcKo0o7zbVXW1jc2t6rbtZ3dvf0D+7D+qOJUYuLhmMWyFyJFGBXE01Qz0kskQTxkpBtObgu/+0SkorF40LOEBByNBI0oRtpIA7vujxDnqOlzpMdhlE3zs4HdcFrOHHCVuCVpgBKdgf3lD2OcciI0ZkipvuskOsiQ1BQzktf8VJEE4Qkakb6hAnGigmyePYenRhnCKJbmCQ3n6u+NDHGlZjw0k0VEtewV4n9eP9XRVZBRkaSaCLw4FKUM6hgWRcAhlQRrNjMEYUlNVojHSCKsTV01U4K7/OVV4p23rlvO/UWjfVO2UQXH4AQ0gQsuQRvcgQ7wAAZT8AxewZuVWy/Wu/WxGK1Y5c4R+APr8wc/95Qg</latexit><latexit sha1_base64="6F05rQ6IIUhALFsWYUmKYX8h5zw=">AAAB+nicbVBPS8MwHE3nvzn/1Xn0EhzCvIxWBPU29OJxgnWDtYw0S7ewJC1JKhulX8WLBxWvfhJvfhvTrQfdfBB4vPf78Xt5YcKo0o7zbVXW1jc2t6rbtZ3dvf0D+7D+qOJUYuLhmMWyFyJFGBXE01Qz0kskQTxkpBtObgu/+0SkorF40LOEBByNBI0oRtpIA7vujxDnqOlzpMdhlE3zs4HdcFrOHHCVuCVpgBKdgf3lD2OcciI0ZkipvuskOsiQ1BQzktf8VJEE4Qkakb6hAnGigmyePYenRhnCKJbmCQ3n6u+NDHGlZjw0k0VEtewV4n9eP9XRVZBRkaSaCLw4FKUM6hgWRcAhlQRrNjMEYUlNVojHSCKsTV01U4K7/OVV4p23rlvO/UWjfVO2UQXH4AQ0gQsuQRvcgQ7wAAZT8AxewZuVWy/Wu/WxGK1Y5c4R+APr8wc/95Qg</latexit><latexit sha1_base64="6F05rQ6IIUhALFsWYUmKYX8h5zw=">AAAB+nicbVBPS8MwHE3nvzn/1Xn0EhzCvIxWBPU29OJxgnWDtYw0S7ewJC1JKhulX8WLBxWvfhJvfhvTrQfdfBB4vPf78Xt5YcKo0o7zbVXW1jc2t6rbtZ3dvf0D+7D+qOJUYuLhmMWyFyJFGBXE01Qz0kskQTxkpBtObgu/+0SkorF40LOEBByNBI0oRtpIA7vujxDnqOlzpMdhlE3zs4HdcFrOHHCVuCVpgBKdgf3lD2OcciI0ZkipvuskOsiQ1BQzktf8VJEE4Qkakb6hAnGigmyePYenRhnCKJbmCQ3n6u+NDHGlZjw0k0VEtewV4n9eP9XRVZBRkaSaCLw4FKUM6hgWRcAhlQRrNjMEYUlNVojHSCKsTV01U4K7/OVV4p23rlvO/UWjfVO2UQXH4AQ0gQsuQRvcgQ7wAAZT8AxewZuVWy/Wu/WxGK1Y5c4R+APr8wc/95Qg</latexit><latexit sha1_base64="C39OhB+IczRcjLNINXH29e9lt8M=">AAAB2HicbZDNSgMxFIXv1L86Vq1rN8EiuCpTN+pOcOOygmML7VAymTttaCYzJHeEMvQFXLhRfDB3vo3pz0KtBwIf5yTk3hMXSloKgi+vtrW9s7tX3/cPGv7h0XGz8WTz0ggMRa5y04+5RSU1hiRJYb8wyLNYYS+e3i3y3jMaK3P9SLMCo4yPtUyl4OSs7qjZCtrBUmwTOmtowVqj5ucwyUWZoSahuLWDTlBQVHFDUiic+8PSYsHFlI9x4FDzDG1ULcecs3PnJCzNjTua2NL9+aLimbWzLHY3M04T+zdbmP9lg5LS66iSuigJtVh9lJaKUc4WO7NEGhSkZg64MNLNysSEGy7INeO7Djp/N96E8LJ90w4eAqjDKZzBBXTgCm7hHroQgoAEXuDNm3iv3vuqqpq37uwEfsn7+Aap5IoM</latexit><latexit sha1_base64="hQhj+CULGdG3QS0LVa1KRUGPisw=">AAAB73icbVA9T8MwFHwpX6UUCF1ZLCqkslQJC7AhsTAWidBKbVQ5rtNatZ3IdlCrKH+FhQEQ/4aNf4PTdoCWkyyd7t7TO1+UcqaN5307la3tnd296n7toH54dOye1J90kilCA5LwRPUirClnkgaGGU57qaJYRJx2o+ld6XefqdIskY9mntJQ4LFkMSPYWGnoNgZjLARuDQQ2kyjOZ8XF0G16bW8BtEn8FWnCCp2h+zUYJSQTVBrCsdZ930tNmGNlGOG0qA0yTVNMpnhM+5ZKLKgO80X2Ap1bZYTiRNknDVqovzdyLLSei8hOlhH1uleK/3n9zMTXYc5kmhkqyfJQnHFkElQWgUZMUWL43BJMFLNZEZlghYmxddVsCf76lzdJcNm+aXsPHlThFM6gBT5cwS3cQwcCIDCDF3iDd6dwXp2PZVsVZ1VbA/7A+fwB4CiSuw==</latexit><latexit sha1_base64="hQhj+CULGdG3QS0LVa1KRUGPisw=">AAAB73icbVA9T8MwFHwpX6UUCF1ZLCqkslQJC7AhsTAWidBKbVQ5rtNatZ3IdlCrKH+FhQEQ/4aNf4PTdoCWkyyd7t7TO1+UcqaN5307la3tnd296n7toH54dOye1J90kilCA5LwRPUirClnkgaGGU57qaJYRJx2o+ld6XefqdIskY9mntJQ4LFkMSPYWGnoNgZjLARuDQQ2kyjOZ8XF0G16bW8BtEn8FWnCCp2h+zUYJSQTVBrCsdZ930tNmGNlGOG0qA0yTVNMpnhM+5ZKLKgO80X2Ap1bZYTiRNknDVqovzdyLLSei8hOlhH1uleK/3n9zMTXYc5kmhkqyfJQnHFkElQWgUZMUWL43BJMFLNZEZlghYmxddVsCf76lzdJcNm+aXsPHlThFM6gBT5cwS3cQwcCIDCDF3iDd6dwXp2PZVsVZ1VbA/7A+fwB4CiSuw==</latexit><latexit sha1_base64="Wrp6sfGRkT1YIiwCAWhsM6HtC1M=">AAAB+nicbVA9T8MwFHzhs5SvUEaWiAqpLFXKAmwVLIxFIrRSE1WO67RWbSeyHdQqyl9hYQDEyi9h49/gtBmg5SRLp7v39M4XJowq7brf1tr6xubWdmWnuru3f3BoH9UeVZxKTDwcs1j2QqQIo4J4mmpGeokkiIeMdMPJbeF3n4hUNBYPepaQgKORoBHFSBtpYNf8EeIcNXyO9DiMsml+PrDrbtOdw1klrZLUoURnYH/5wxinnAiNGVKq33ITHWRIaooZyat+qkiC8ASNSN9QgThRQTbPnjtnRhk6USzNE9qZq783MsSVmvHQTBYR1bJXiP95/VRHV0FGRZJqIvDiUJQyR8dOUYQzpJJgzWaGICypyergMZIIa1NX1ZTQWv7yKvEumtdN996tt2/KNipwAqfQgBZcQhvuoAMeYJjCM7zCm5VbL9a79bEYXbPKnWP4A+vzBz63lBw=</latexit><latexit sha1_base64="6F05rQ6IIUhALFsWYUmKYX8h5zw=">AAAB+nicbVBPS8MwHE3nvzn/1Xn0EhzCvIxWBPU29OJxgnWDtYw0S7ewJC1JKhulX8WLBxWvfhJvfhvTrQfdfBB4vPf78Xt5YcKo0o7zbVXW1jc2t6rbtZ3dvf0D+7D+qOJUYuLhmMWyFyJFGBXE01Qz0kskQTxkpBtObgu/+0SkorF40LOEBByNBI0oRtpIA7vujxDnqOlzpMdhlE3zs4HdcFrOHHCVuCVpgBKdgf3lD2OcciI0ZkipvuskOsiQ1BQzktf8VJEE4Qkakb6hAnGigmyePYenRhnCKJbmCQ3n6u+NDHGlZjw0k0VEtewV4n9eP9XRVZBRkaSaCLw4FKUM6hgWRcAhlQRrNjMEYUlNVojHSCKsTV01U4K7/OVV4p23rlvO/UWjfVO2UQXH4AQ0gQsuQRvcgQ7wAAZT8AxewZuVWy/Wu/WxGK1Y5c4R+APr8wc/95Qg</latexit><latexit sha1_base64="6F05rQ6IIUhALFsWYUmKYX8h5zw=">AAAB+nicbVBPS8MwHE3nvzn/1Xn0EhzCvIxWBPU29OJxgnWDtYw0S7ewJC1JKhulX8WLBxWvfhJvfhvTrQfdfBB4vPf78Xt5YcKo0o7zbVXW1jc2t6rbtZ3dvf0D+7D+qOJUYuLhmMWyFyJFGBXE01Qz0kskQTxkpBtObgu/+0SkorF40LOEBByNBI0oRtpIA7vujxDnqOlzpMdhlE3zs4HdcFrOHHCVuCVpgBKdgf3lD2OcciI0ZkipvuskOsiQ1BQzktf8VJEE4Qkakb6hAnGigmyePYenRhnCKJbmCQ3n6u+NDHGlZjw0k0VEtewV4n9eP9XRVZBRkaSaCLw4FKUM6hgWRcAhlQRrNjMEYUlNVojHSCKsTV01U4K7/OVV4p23rlvO/UWjfVO2UQXH4AQ0gQsuQRvcgQ7wAAZT8AxewZuVWy/Wu/WxGK1Y5c4R+APr8wc/95Qg</latexit><latexit sha1_base64="6F05rQ6IIUhALFsWYUmKYX8h5zw=">AAAB+nicbVBPS8MwHE3nvzn/1Xn0EhzCvIxWBPU29OJxgnWDtYw0S7ewJC1JKhulX8WLBxWvfhJvfhvTrQfdfBB4vPf78Xt5YcKo0o7zbVXW1jc2t6rbtZ3dvf0D+7D+qOJUYuLhmMWyFyJFGBXE01Qz0kskQTxkpBtObgu/+0SkorF40LOEBByNBI0oRtpIA7vujxDnqOlzpMdhlE3zs4HdcFrOHHCVuCVpgBKdgf3lD2OcciI0ZkipvuskOsiQ1BQzktf8VJEE4Qkakb6hAnGigmyePYenRhnCKJbmCQ3n6u+NDHGlZjw0k0VEtewV4n9eP9XRVZBRkaSaCLw4FKUM6hgWRcAhlQRrNjMEYUlNVojHSCKsTV01U4K7/OVV4p23rlvO/UWjfVO2UQXH4AQ0gQsuQRvcgQ7wAAZT8AxewZuVWy/Wu/WxGK1Y5c4R+APr8wc/95Qg</latexit><latexit sha1_base64="6F05rQ6IIUhALFsWYUmKYX8h5zw=">AAAB+nicbVBPS8MwHE3nvzn/1Xn0EhzCvIxWBPU29OJxgnWDtYw0S7ewJC1JKhulX8WLBxWvfhJvfhvTrQfdfBB4vPf78Xt5YcKo0o7zbVXW1jc2t6rbtZ3dvf0D+7D+qOJUYuLhmMWyFyJFGBXE01Qz0kskQTxkpBtObgu/+0SkorF40LOEBByNBI0oRtpIA7vujxDnqOlzpMdhlE3zs4HdcFrOHHCVuCVpgBKdgf3lD2OcciI0ZkipvuskOsiQ1BQzktf8VJEE4Qkakb6hAnGigmyePYenRhnCKJbmCQ3n6u+NDHGlZjw0k0VEtewV4n9eP9XRVZBRkaSaCLw4FKUM6hgWRcAhlQRrNjMEYUlNVojHSCKsTV01U4K7/OVV4p23rlvO/UWjfVO2UQXH4AQ0gQsuQRvcgQ7wAAZT8AxewZuVWy/Wu/WxGK1Y5c4R+APr8wc/95Qg</latexit><latexit sha1_base64="6F05rQ6IIUhALFsWYUmKYX8h5zw=">AAAB+nicbVBPS8MwHE3nvzn/1Xn0EhzCvIxWBPU29OJxgnWDtYw0S7ewJC1JKhulX8WLBxWvfhJvfhvTrQfdfBB4vPf78Xt5YcKo0o7zbVXW1jc2t6rbtZ3dvf0D+7D+qOJUYuLhmMWyFyJFGBXE01Qz0kskQTxkpBtObgu/+0SkorF40LOEBByNBI0oRtpIA7vujxDnqOlzpMdhlE3zs4HdcFrOHHCVuCVpgBKdgf3lD2OcciI0ZkipvuskOsiQ1BQzktf8VJEE4Qkakb6hAnGigmyePYenRhnCKJbmCQ3n6u+NDHGlZjw0k0VEtewV4n9eP9XRVZBRkaSaCLw4FKUM6hgWRcAhlQRrNjMEYUlNVojHSCKsTV01U4K7/OVV4p23rlvO/UWjfVO2UQXH4AQ0gQsuQRvcgQ7wAAZT8AxewZuVWy/Wu/WxGK1Y5c4R+APr8wc/95Qg</latexit>

�(d)<latexit sha1_base64="Z+u0Yue3pS9TMvqDyVRY2E+Un5Q=">AAAB+nicbVBPS8MwHE39O+e/Oo9egkOYl9GKoN6GXjxOsG6wlpGm6RaWpCVJxVH6Vbx4UPHqJ/HmtzHdetDNB4HHe78fv5cXpowq7Tjf1srq2vrGZm2rvr2zu7dvHzQeVJJJTDycsET2Q6QIo4J4mmpG+qkkiIeM9MLJTen3HolUNBH3epqSgKORoDHFSBtpaDf8EeIctXyO9DiM86g4HdpNp+3MAJeJW5EmqNAd2l9+lOCME6ExQ0oNXCfVQY6kppiRou5niqQIT9CIDAwViBMV5LPsBTwxSgTjRJonNJypvzdyxJWa8tBMlhHVoleK/3mDTMeXQU5Fmmki8PxQnDGoE1gWASMqCdZsagjCkpqsEI+RRFibuuqmBHfxy8vEO2tftZ2782bnumqjBo7AMWgBF1yADrgFXeABDJ7AM3gFb1ZhvVjv1sd8dMWqdg7BH1ifPyGTlAw=</latexit><latexit sha1_base64="Z+u0Yue3pS9TMvqDyVRY2E+Un5Q=">AAAB+nicbVBPS8MwHE39O+e/Oo9egkOYl9GKoN6GXjxOsG6wlpGm6RaWpCVJxVH6Vbx4UPHqJ/HmtzHdetDNB4HHe78fv5cXpowq7Tjf1srq2vrGZm2rvr2zu7dvHzQeVJJJTDycsET2Q6QIo4J4mmpG+qkkiIeM9MLJTen3HolUNBH3epqSgKORoDHFSBtpaDf8EeIctXyO9DiM86g4HdpNp+3MAJeJW5EmqNAd2l9+lOCME6ExQ0oNXCfVQY6kppiRou5niqQIT9CIDAwViBMV5LPsBTwxSgTjRJonNJypvzdyxJWa8tBMlhHVoleK/3mDTMeXQU5Fmmki8PxQnDGoE1gWASMqCdZsagjCkpqsEI+RRFibuuqmBHfxy8vEO2tftZ2782bnumqjBo7AMWgBF1yADrgFXeABDJ7AM3gFb1ZhvVjv1sd8dMWqdg7BH1ifPyGTlAw=</latexit><latexit sha1_base64="Z+u0Yue3pS9TMvqDyVRY2E+Un5Q=">AAAB+nicbVBPS8MwHE39O+e/Oo9egkOYl9GKoN6GXjxOsG6wlpGm6RaWpCVJxVH6Vbx4UPHqJ/HmtzHdetDNB4HHe78fv5cXpowq7Tjf1srq2vrGZm2rvr2zu7dvHzQeVJJJTDycsET2Q6QIo4J4mmpG+qkkiIeM9MLJTen3HolUNBH3epqSgKORoDHFSBtpaDf8EeIctXyO9DiM86g4HdpNp+3MAJeJW5EmqNAd2l9+lOCME6ExQ0oNXCfVQY6kppiRou5niqQIT9CIDAwViBMV5LPsBTwxSgTjRJonNJypvzdyxJWa8tBMlhHVoleK/3mDTMeXQU5Fmmki8PxQnDGoE1gWASMqCdZsagjCkpqsEI+RRFibuuqmBHfxy8vEO2tftZ2782bnumqjBo7AMWgBF1yADrgFXeABDJ7AM3gFb1ZhvVjv1sd8dMWqdg7BH1ifPyGTlAw=</latexit>

�<latexit sha1_base64="PHtNjW6na207435B/B6JIWe5ANM=">AAAB7HicbVDLSgNBEOz1GeMr6tHLYBA8hV0R1FvQi8cIbhJIljA7mU3GzGOZmRXCkn/w4kHFqx/kzb9xkuxBEwsaiqpuurvilDNjff/bW1ldW9/YLG2Vt3d29/YrB4dNozJNaEgUV7odY0M5kzS0zHLaTjXFIua0FY9up37riWrDlHyw45RGAg8kSxjB1knNrmEDgXuVql/zZ0DLJChIFQo0epWvbl+RTFBpCcfGdAI/tVGOtWWE00m5mxmaYjLCA9pxVGJBTZTPrp2gU6f0UaK0K2nRTP09kWNhzFjErlNgOzSL3lT8z+tkNrmKcibTzFJJ5ouSjCOr0PR11GeaEsvHjmCimbsVkSHWmFgXUNmFECy+vEzC89p1zb+/qNZvijRKcAwncAYBXEId7qABIRB4hGd4hTdPeS/eu/cxb13xipkj+APv8wcIeY72</latexit><latexit sha1_base64="PHtNjW6na207435B/B6JIWe5ANM=">AAAB7HicbVDLSgNBEOz1GeMr6tHLYBA8hV0R1FvQi8cIbhJIljA7mU3GzGOZmRXCkn/w4kHFqx/kzb9xkuxBEwsaiqpuurvilDNjff/bW1ldW9/YLG2Vt3d29/YrB4dNozJNaEgUV7odY0M5kzS0zHLaTjXFIua0FY9up37riWrDlHyw45RGAg8kSxjB1knNrmEDgXuVql/zZ0DLJChIFQo0epWvbl+RTFBpCcfGdAI/tVGOtWWE00m5mxmaYjLCA9pxVGJBTZTPrp2gU6f0UaK0K2nRTP09kWNhzFjErlNgOzSL3lT8z+tkNrmKcibTzFJJ5ouSjCOr0PR11GeaEsvHjmCimbsVkSHWmFgXUNmFECy+vEzC89p1zb+/qNZvijRKcAwncAYBXEId7qABIRB4hGd4hTdPeS/eu/cxb13xipkj+APv8wcIeY72</latexit><latexit sha1_base64="PHtNjW6na207435B/B6JIWe5ANM=">AAAB7HicbVDLSgNBEOz1GeMr6tHLYBA8hV0R1FvQi8cIbhJIljA7mU3GzGOZmRXCkn/w4kHFqx/kzb9xkuxBEwsaiqpuurvilDNjff/bW1ldW9/YLG2Vt3d29/YrB4dNozJNaEgUV7odY0M5kzS0zHLaTjXFIua0FY9up37riWrDlHyw45RGAg8kSxjB1knNrmEDgXuVql/zZ0DLJChIFQo0epWvbl+RTFBpCcfGdAI/tVGOtWWE00m5mxmaYjLCA9pxVGJBTZTPrp2gU6f0UaK0K2nRTP09kWNhzFjErlNgOzSL3lT8z+tkNrmKcibTzFJJ5ouSjCOr0PR11GeaEsvHjmCimbsVkSHWmFgXUNmFECy+vEzC89p1zb+/qNZvijRKcAwncAYBXEId7qABIRB4hGd4hTdPeS/eu/cxb13xipkj+APv8wcIeY72</latexit>+

+

60256 256 256 256 256 256 256 256

60

24

256 128 128 128 128

Layer Types

Fully-connected, ReLU activation

Fully-connected, no activation

Fully-connected, sigmoid activation

Fig. 7: A visualization of our fully-connected network architecture. Input vectorsare shown in green, intermediate hidden layers are shown in blue, output vec-tors are shown in red, and the number inside each block signifies the vector’sdimension. All layers are standard fully-connected layers, black arrows indicatelayers with ReLU activations, orange arrows indicate layers with no activation,dashed arrows indicate layers with sigmoid activation, and “+” denotes vectorconcatenation. The positional encoding of the input location (γ(x)) is passedthrough 8 fully-connected ReLU layers, each with 256 channels. We follow theDeepSDF [23] architecture and include a skip connection that concatenates thisinput to the fifth layer’s activation. An additional layer outputs the volume den-sity σ (which is rectified using a ReLU to ensure that the output volume densityis nonnegative) and a 256-dimensional feature vector. This feature vector is con-catenated with the positional encoding of the input viewing direction (γ(d)),and is processed by 4 additional fully-connected ReLU layers, each with 128channels. A final layer (with a sigmoid activation) outputs the emitted RGBradiance at position x, as viewed by a ray with direction d.

Quantitative Comparisons The SRN implementation published by the au-thors requires a significant amount of GPU memory, and is limited to an imageresolution of 512 × 512 pixels even when parallelized across 4 NVIDIA V100GPUs. We compute quantitative metrics for SRN at 512 × 512 pixels for oursynthetic datasets and 504 × 376 pixels for the real datasets, in comparison to800× 800 and 1008× 752 respectively for the other methods that can be run athigher resolutions.

C Additional Results

Per-scene breakdown Tables 3, 4, and 5 include a breakdown of the quanti-tative results presented in the main paper into per-scene metrics. The per-scenebreakdown is consistent with the aggregate quantitative metrics presented inthe paper, where our method quantitatively outperforms all baselines. AlthoughLLFF achieves slightly better LPIPS metrics, we urge readers to view our sup-plementary video where our method achieves better multiview consistency andproduces fewer artifacts than all baselines.


Pedestal

Cube

Ground Truth NeRF (ours) LLFF [20] SRN [30] NV [16]

Fig. 8: Comparisons on test-set views for scenes from the DeepVoxels [29] syn-thetic dataset. The objects in this dataset have simple geometry and perfectlydiffuse reflectance. Because of the large number of input images (479 views)and simplicity of the rendered objects, both our method and LLFF [20] performnearly perfectly on this data. LLFF still occasionally presents artifacts when in-terpolating between its 3D volumes, as in the top inset for each object. SRN [30]and NV [16] do not have the representational power to render fine details.

PSNR↑ SSIM↑ LPIPS↓Chair Pedestal Cube Vase Chair Pedestal Cube Vase Chair Pedestal Cube Vase

DeepVoxels [29] 33.45 32.35 28.42 27.99 0.99 0.97 0.97 0.96 − − − −SRN [30] 36.67 35.91 28.74 31.46 0.992 0.986 0.976 0.991 0.093 0.081 0.074 0.044NV [16] 35.15 36.47 26.48 20.39 0.991 0.989 0.964 0.841 0.096 0.069 0.113 0.117LLFF [20] 36.11 35.87 32.58 32.97 0.996 0.994 0.994 0.995 0.051 0.039 0.064 0.039Ours 42.65 41.44 39.19 37.32 0.997 0.997 0.999 0.998 0.047 0.024 0.006 0.017

Table 3: Per-scene quantitative results from the DeepVoxels [29] dataset. The“scenes” in this dataset are all diffuse objects with simple geometry, renderedfrom texture-mapped meshes captured by a 3D scanner. The metrics for theDeepVoxels method are taken directly from their paper, which does not reportLPIPS and only reports two significant figures for SSIM.


PSNR↑Chair Drums Ficus Hotdog Lego Materials Mic Ship

SRN [30] 26.96 17.18 20.73 26.81 20.85 18.09 26.85 20.60NV [16] 28.33 22.58 24.79 30.71 26.08 24.22 27.78 23.93LLFF [20] 28.72 21.13 21.79 31.41 24.54 20.72 27.48 23.22Ours 33.00 25.01 30.13 36.18 32.54 29.62 32.91 28.65

SSIM↑Chair Drums Ficus Hotdog Lego Materials Mic Ship

SRN [30] 0.954 0.724 0.857 0.945 0.870 0.846 0.974 0.763NV [16] 0.972 0.926 0.949 0.974 0.947 0.947 0.976 0.862LLFF [20] 0.976 0.905 0.903 0.980 0.945 0.909 0.979 0.885Ours 0.992 0.956 0.985 0.991 0.988 0.982 0.992 0.928

LPIPS↓Chair Drums Ficus Hotdog Lego Materials Mic Ship

SRN [30] 0.106 0.267 0.149 0.100 0.200 0.174 0.063 0.299NV [16] 0.109 0.214 0.162 0.109 0.175 0.130 0.107 0.276LLFF [20] 0.064 0.126 0.130 0.061 0.110 0.117 0.084 0.218Ours 0.046 0.091 0.044 0.121 0.050 0.063 0.028 0.206

Table 4: Per-scene quantitative results from our realistic synthetic dataset. The“scenes” in this dataset are all objects with more complex gometry and non-Lambertian materials, rendered using Blender’s Cycles pathtracer.


PSNR↑Room Fern Leaves Fortress Orchids Flower T-Rex Horns

SRN [30] 27.29 21.37 18.24 26.63 17.37 24.63 22.87 24.33LLFF [20] 28.42 22.85 19.52 29.40 18.52 25.46 24.15 24.70Ours 32.70 25.17 20.92 31.16 20.36 27.40 26.80 27.45

SSIM↑Room Fern Leaves Fortress Orchids Flower T-Rex Horns

SRN [30] 0.950 0.822 0.822 0.838 0.746 0.916 0.916 0.921LLFF [20] 0.963 0.887 0.877 0.957 0.775 0.935 0.935 0.941Ours 0.978 0.932 0.904 0.962 0.852 0.941 0.960 0.951

LPIPS↓Room Fern Leaves Fortress Orchids Flower T-Rex Horns

SRN [30] 0.240 0.459 0.440 0.453 0.467 0.288 0.298 0.376LLFF [20] 0.155 0.247 0.216 0.173 0.313 0.174 0.222 0.193Ours 0.178 0.280 0.316 0.171 0.321 0.219 0.249 0.268

Table 5: Per-scene quantitative results from our real image dataset. The scenesin this dataset are all captured with a forward-facing handheld cellphone.

Documents

NeRF: Representing Scenes as Neural Radiance Fields for View … · 2020-03-20 · NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis 3 {A positional encoding