Fast Training of Neural Lumigraph Representations using Meta Learning Note

5 min readMar 20, 2022

Bergman, A., Kellnhofer, P., & Wetzstein, G. (2021). Fast training of neural lumigraph representations using meta learning. Advances in Neural Information Processing Systems, 34.

Introduction

Learning 3D scene representation from partial observation captured by 2D image is a hot topic in the research field of ML, CV and CG. Such a representation can be used to reason about the scene or render the novel view.

The two questions: (1) how to parameterize the scene? (2) how to infer the parameter from the observations efficiently?

Related Work

Image-based rendering(IBR)

Many existing approaches require the proxy geometry to be estimated as a pre-processing step. This might take minutes to hours for a single scene.

Neural scene representation and rendering

The comparison of different neural scene

Recent works are focus on data structure, network factorizations, pruning, importance sampling, fast integration and strategies to accelerate the rendering speed, however, training time is still extremely slow.

Generalization with neural scene representation

Generalization is crucial for learning prior and for 3D generative-adversarial networks.

Such approaches including conditioning by concatenation, hyper networks, modulation or feature-wise linear transforms and meta learning.

Method

The overview of MetaNLR++, raised by the authors

Image formation model

Authors define a image formation model as a learn pixel-wise aggregation of input image features and their target viewing direction.

The encoder E and decoder D are implemented as resolution-preserving convolution neural network(CNN) architectures. The use of CNN encoder and decoder increases the receptive field of the image loss applied, allowing for more meaningful gradients to propagated back into the shape representation.

The feature aggregation function Г operates on the surface of our shape representation Φ. To define the shape surface, they use SIREN represents the signed-distance function(SDF) in 3D space. This encodes the surface of object as zero-level set.

First, to check whether or not the features is occluded, they use the sphere tracing method for each pixel from the input view and ensure that the target view surface position projected into each of these surfaces is at the same depth as the source view surface position. The occluded features are discarded and then outputs each of the re-sampled feature map.

Second, when the re-sampled feature map have been generated, the aggregation function aggregates them into a single target feature which can be processed by the decoder. This is done by performing a weighted averaging operation on the input image using the relative feature weights.

Finally, the aggregation function results in a H*W*d feature map, inputting it into the decoder, then this model will return a synthesized view with the target viewpoint.

Supervision and training

Since NLR++ is end-to-end differentiable, we can optimize the parameters ξ, ψ, θ, ζ end-to -end to reconstruct target views.

To train the model more efficient, the authors argument the loss supervision schedule and batching strategy for their model.

The image reconstruction loss is computed as a masked L1 loss on rendered images:

To quickly bootstrap the neural shape learning from the object mask, we apply a soft mask loss on the rendered image mask:

Finally, regularize the shape representation to model a valid SDF by enforcing the eikonal constraint on randomly sampled point Pi where it belongs to 3 dimensional real space.

The total loss function can be estimated as follows:

Generalization using meta learning

What is meta-learning?

This paper use meta-learning to initialize the parameters for all networks in NLR++, and use the Reptile algorithm to learn the initialization of iteration.

Experiments

Peek Signal to Noise Ratio(PSNR): use to measure the image quality when outputting the compressed image. (the large number means high quality, objectively)

Learned Perceptual Image Patch Similarity (LPIPS): Evaluate the distance between image patches. Higher means more different. Lower means more similar.

PSNR, LPIPS and real-time-frame-quality comparison

different method — novel view synthesized

Discussion and Conclusion

(1) raised an efficient method for inferring 3D scene parameters from 2D image observation using meta learning.

(2) MetaNLR++ reduced the representation training time and render at real time rates without sacrificing on the image quality.

Limitation

(1) Object mask are required to supervise the ray-mask loss. While these can be automatically computed for some data, this poses a challenge in cluttered scenes, or applications which could generalized to arbitrary scenes.

(2) Limited by memory consumption, since the CNN feature encoder/decoder process the entire image at a time.

(3) When opting to utilize the full capacity of the CNN feature processing, as learning a detailed neutral shape is slower than modeling fine details with features.

Future work

(1) the experiments used the known camera poses to reconstruct the shape. Future work on jointly optimizing camera pose with the representation raised by this paper is certainly possible, and step a in the direction for general view synthesize.

(2) the capacity of the feature generation method versus the quality of the shape.