Notes on NeRFs

NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

$\newcommand{\vec}[1]{\mathbf{#1}} \\ \newcommand{\real}[1]{\mathbb{R}^{#1}} \\ \newcommand{\rvec}[1]{\left[#1 \right]} \\ \newcommand{\func}[2]{#1\left(#2\right)} \\ \newcommand{\rt}{\func{\rr}{t}} \newcommand{\rr}{\mathbf{r}} \newcommand{\clr}{\func{c}{\rt, \mathbf{d}}} \\ \newcommand{\Clr}{\func{C}{\rr}} \\ \newcommand{\hatClr}{\func{\hat{C}}{\rr}} \\ \newcommand{\hatClri}[1]{\func{\hat{C}_{#1}}{\rr}} \\ \newcommand{\voldensityr}[1]{\func{\sigma}{\func{\rr}{#1}}}$

Neural Radiance Field Representation

A scene is a function $f: \vec{x} \rightarrow \vec{y}, \vec{x} \in \real{5}, \vec{y} \in \real{4}$.
$\vec{x}$ consists of 3D coordinates $\rvec{x, y, z}$ and viewing directon $\rvec{\phi, \theta}$.
$\vec{y}$ consists of a colour $\rvec{r, g, b}$ and a volume density $\sigma$.
$f$ is approximated by an MLP as follows:
\[\rvec{\vec{s}, \sigma} = f_1(\rvec{x, y, z})\\ \rvec{r, g, b} = f_2(\rvec{\phi, \theta, \vec{s}})\\ \vec{y} = \rvec{r, g, b, \sigma}\]
Here $\vec{s}$ is an intermediate feature predicted by $f_1$.
What this does is to make colour of a point depend on the 3D location as well as view direction whilst letting volume density be invariant to viewing direction.
$\func{\sigma}{\mathbf{x}}d\mathbf{x}$ can be interpreted as the probability that a ray of light terminates at an infinitesimal particle a location $\mathbf{x}$

Volume Rendering with Radiance Fields

The expected colour of a ray $\rr$ is the weighted colour at various points along the ray between a near and far distance ($t_n$ and $t_f$).
A ray $\rt$ which can be written with respect to an origin $\mathbf{o}$ and a direction $\mathbf{d}$ as $\mathbf{r} = \mathbf{o} + t\mathbf{d}$.
The colour can then be defined as follows:

\[\Clr = \int_{t_n}^{t_f}{\clr \voldensityr{t}}\func{T}{t} \\ \func{T}{t} = \func{\exp}{ -\int_{t_n}^{t_f}{ \voldensityr{s} ds } }\]

The weighting function $T(t)$ is the accumulated transmitance along the ray upto $t$ starting from $t_n$.
As noted earlier, density is the same at any distance $t$ from the origin but colour varies with direction as well.
The integral is approximated as follows:
- Divide the region between $t_n$ and $t_f$ into $N$ sections
- Sample a point from each section
- Evaluate $\clr$ and $\voldensityr{t}$ at these points
- Use these values to estimate $\Clr$ using the following:
\[\hatClr = \sum_{i=1}^{N}\left( 1 - \func{\exp}{ \voldensityr{t_i}\delta_i } \right) \func{\hat{T}}{t_i} \\ \func{\hat{T}}{t_{i}} = \func{\exp}{ -\sum_{i'=1}^{i-1}{ {\voldensityr{t_{i'}}\delta_{i'}} } } \\ \delta_i = t_{i+1} - t_i\]
- They don’t say how $t_{i+1}$ is defined for $i=N$ though.

Optimising a Neural Radiance Field

The sampling described above is described as coarse sampling.
We can interpret the terms $\left( 1 - \func{\exp}{ \voldensityr{t_i}\delta_i } \right) \func{\hat{T}}{t_i}$ as a piecewise constant distribution along the ray.
The fine sampling is as follows:
- Suppose that we sample $N_c$ point for the coarse estimatte
- Sample a further $N_f$ points based on this distribution
- Evaluate the colour of the previously sampled $N_c$ points and these $N_f$ points
The loss has a component for the coarse and fine colour estimates:

\[\sum_{\rr \in \mathcal{R}}{ \lVert{\hatClri{c} - \Clr}\rVert_2^2 } + { \lVert{\hatClri{f} - \Clr}\rVert_2^2 }\]

At inference only the fine estimate is used but as the coarse sampling gives rise to the distribution used for the fine sampling we also train with respect to the coarse estimates in order to improve this distribution.

Architecture

The function $F$ is approximated by an MLP.
The neural network is a representation for just one scene so for each new scene we have to train a new model.
An image grid can be expressed in terms of an $\mathbf{o}$ and various values of $t$ and $\mathbf{d}$ so to render colours for positions in the grid you predict $\hatClr$ for each of these positions.
The inputs $p$ consist of each coordinate of $\rr$ normalised to lie between $\rvec{1, -1}$ and each coordinate of $\mathbf{d}$).
Each of these are transformed into $2L$-dimensionsal positional encodings of the form:

\[\left( \sin(2^0\pi p), \cos(2^0\pi p), \ldots, \sin(2^{L-1}\pi p), \cos(2^{L-1}\pi p) \right)\]

Experiments

They measure PSNR, SSIM and LPIPS for two kinds of artificially generated images (“Realistic” and “Diffuse”) and a set of real images.
For all the synthetic images they beat earler models on all the metrics.
For real images LPIPS of another model (LLFF) is better but they claim that their approach produces qualtitatively better results.

References

NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis