To condition the canonical neural field on the observed views, we extract three types of features: global
\( f_\textrm{glob} \), voxel-based \( f_\textrm{vox} \), and pixel-aligned \( f_\textrm{pix} \).
The three features have complementary strengths and tackle different key challenges: \( f_\textrm{vox} \)
can resolve
occlusions, inject prior and compensate for slight pose inaccuracies; \( f_\textrm{glob} \) captures
overall characteristics
and appearance of the subject through a flat (1D) latent code, which further facilitates prior injection
and
reconstruction of unobserved regions; \( f_\textrm{pix} \), when available, provides direct, high-quality
appearance
information.
First, we extract the 2D featuremaps \( F_t \), which we pass through our VoluMorph module using
initial
heuristic motion weights to get
the final motion weights \( W \).
The features \( F_t \) and motion weights \( W \) are passed to a second VoluMorph module, which
outputs
the volume \( V \) and a
global latent code. Finally, we extract \( f_\textrm{vox}, f_\textrm{glob}, f_\textrm{pix} \) and combine
them using the
feature fusion module to condition the NeRF MLP.