Project 5: Neural Radiance Fields
Part 1: Fit a Neural Field to a 2D Image
To fit a Neural Field to a 2D image, I construct an MLP with the following architecture:
For positional encodings, I use the sine/cosine encodings from the NERF paper
with L=10. The full input is a (10*2 + 1)*2 = 42 dimensional vector.
I train the network with a learning rate of .01 and a batch size of 10000 for 3000 epochs.
L = 10 Results
These are the best results. With L=10 and the network as described above, the reconstructed fox is in sharp focus.
L = 0 Results
With L = 0, there are no positional encodings. The network has a lot of trouble figuring out the geometry of the image,
and reconstructing finer features that require small edges.
L = 10 with 2 extra hidden layers Results
Curiously, the deeper network has more trouble learning color, but learns the shape of the fox quickly.
L= 10 on Neuschwanstein
Neuschwanstein is a much harder task to memorize because it is filled with intricate details. However, this network
still achieves a reasonable reconstruction of the image.
Original:
Reconstruction training:
Part 2: Fit a Neural Radiance Field from Multi-view Images
Part 2.1: Create Rays from Cameras
Before creating the NERFS, some machinery is needed to convert coordinates. To convert from world to camera coordinates,
we define a rotation and translation matrix:
\begin{align} \begin{bmatrix} x_c \\ y_c \\ z_c \\ 1 \end{bmatrix} = \begin{bmatrix} \mathbf{R}_{3\times3} &
\mathbf{t} \\ \mathbf{0}_{1\times3} & 1 \end{bmatrix} \begin{bmatrix} x_w \\ y_w \\ z_w \\ 1 \end{bmatrix} \end{align}
To invert this transformation and convert from camera to world (c2w), I multiply the inverse of the extrinsic matrix by the camera
coordinates.
\begin{align} \begin{bmatrix} x_w \\ y_w \\ z_w \\ 1 \end{bmatrix} = \begin{bmatrix} \mathbf{R}_{3\times3} &
\mathbf{t} \\ \mathbf{0}_{1\times3} & 1 \end{bmatrix}^{-1} \begin{bmatrix} x_c \\ y_c \\ z_c \\ 1 \end{bmatrix} \end{align}
To convert from camera coordinates to pixels, we first define the intrinsics matrix:
\begin{align}
\mathbf{K} = \begin{bmatrix} f_x & 0 & o_x \\ 0 & f_y & o_y \\ 0 & 0 & 1 \end{bmatrix} \end{align}
which converts between the coordinate systems as follows:
\begin{align} s \begin{bmatrix} u \\ v \\ 1 \end{bmatrix} = \mathbf{K} \begin{bmatrix} x_c \\ y_c \\ z_c \end{bmatrix} \end{align}
To convert from pixels to camera coodinates:
\begin{align} \mathbf{K}^{-1} s \begin{bmatrix} u \\ v \\ 1 \end{bmatrix} = \begin{bmatrix} x_c \\ y_c \\ z_c \end{bmatrix} \end{align}
To convert from pixels to rays, we need to determine the camera location and the direction of the ray. To calculate the camera location
from the c2w matrix:
\begin{align} \mathbf{r}_o =
-\mathbf{R}_{3\times3}^{-1}\mathbf{t} \\
\mathbf{r}_o = \begin{bmatrix} \mathbf{R}_{3\times3} &
\mathbf{t} \\ \mathbf{0}_{1\times3} & 1 \end{bmatrix}^{-1}
\begin{bmatrix} 0 \\ 0 \\ 0 \\ 1 \end{bmatrix}
\end{align}
The normalized ray direction is then calculated:
\begin{align} \mathbf{r}_d = \frac{\mathbf{X_w} - \mathbf{r}_o}{||\mathbf{X_w} -
\mathbf{r}_o||_2} \end{align}
I use the previously defined functions to convert from pixel coordinates to camera coordinates, and then from camera coordinates
to world coordinates.
Part 2.2: Sampling
To sample rays for training, I randomly select pixels from all of the training images, and compute the rays that correspond to each.
To sample along the ray, I take 32 evenly spaced points between 2.0 and 6.0 units away from the start of the ray. During training,
I randomly perturb these samples to be anywhere within the small interval around them. This helps avoid aliasing.
Part 2.3: Putting the Dataloading All Together
Part 2.4: Neural Radiance Field
I use the following architecture for training NERFs. In the 3D setting, we input both the direction of the ray and the coordinates
of the points along the ray that we are querying. The NERF then outputs a density and an rgb for each point.
The coordinates of the points along the ray, x, are positionally encoded with L=10. The ray directions themselves are also positionally
encoded with L=4.
Part 2.5: Volume Rendering
To render the pixel, I use the discrete volume rendering equation:
\begin{align}
\hat{C}(\mathbf{r})=\sum_{i=1}^N T_i\left(1-\exp \left(-\sigma_i \delta_i\right)\right) \mathbf{c}_i, \text { where } T_i=\exp
\left(-\sum_{j=1}^{i-1} \sigma_j \delta_j\right) \end{align}
Where \(T_i\) is the transmittance, \(\sigma_j\) is the density, \(\delta_j\) is the sample interval, and \(\mathbf{c}_i\) is the
color at this point.
During training, I first sample a set of rays, sample points along those rays, and feed this data through the network.
The network returns densities and rgbs, so I use the above volume rendering equation to calculate the predicted pixel color.
This is then supervised using MSE loss.
Batch size is 10000 and learning rate is .0005 with the Adam optimizer. The below results were trained for 10,000 epochs.
Bells and Whistles
Depth Rendering
It is possible to extract a depth map of the scene from the trained NERF. To do this, I modify the volume rendering equation
to be:
\begin{align}
\hat{C}(\mathbf{r})=\sum_{i=1}^N T_i\left(1-\exp \left(-\sigma_i \delta_i\right)\right) \mathbf{d}_i, \text { where } \mathbf{d}_i=\sum_{j=1}^{i-1} \delta_j \end{align}
Where \(\mathbf{d}_i\) is the distance along the ray. I then normalize the colors to be between 0 and 1.
Changing Background Color
It is also possible to change the background color of the rendered images by modifying the volrend function. To do so,
I add modify \(C_r\) as follows:
\begin{align}
\hat{C}^{'}(\mathbf{r}) = \hat{C}(\mathbf{r}) + T_{N+1}\cdot \mathbf{c_{bg}}
\end{align}
\(\mathbf{c_{bg}}\) represents the chosen background color, and \(T_{N+1}\) is the probability of the ray not terminating
at any of the sample points along the ray, in which case it hits the background. Here is the scene with various
colored backgrounds after 3000 epochs.
One quirk I noticed during training was that, past a certain point (around 5000 epochs), the NERF began to learn
a black floor beneath the truck. This is apparent when the background is rendered to a different color: