HumMorph: Generalized Dynamic Human Neural Fields from Few Views

University of Edinburgh
[TL;DR] HumMorph is a feed-forward model that generates animatable human NeRFs from a few views in in arbitrary poses. It is particularly useful when the human pose parameters need to be estimated directly from the observed views.

Observed poses estimated directly from each input view (shown in red).
The letter in parentheses indicates input views: L - left, R - right or B - both.

Abstract

We introduce HumMorph, a novel generalized approach to free-viewpoint rendering of dynamic human bodies with explicit pose control. HumMorph renders a human actor in any specified pose given a few observed views (starting from just one) in arbitrary poses. Our method enables fast inference as it relies only on feed-forward passes through the model. We first construct a coarse representation of the actor in the canonical T-pose, which combines visual features from individual partial observations and fills missing information using learned prior knowledge. The coarse representation is complemented by fine-grained pixel-aligned features extracted directly from the observed views, which provide high-resolution appearance information. We show that HumMorph is competitive with the state of the art when only a single input view is available, however, we achieve results with significantly better visual quality given just 2 monocular observations.

Moreover, previous generalized methods assume access to accurate body shape and pose parameters obtained using synchronized multi-camera setups. In contrast, we consider a more practical scenario where these body parameters are noisily estimated directly from the observed views. Our experimental results demonstrate that our architecture is more robust to errors in the noisy parameters and clearly outperforms the state of the art in this setting.

Results with accurate body poses on HuMMan

The accurate poses are estimated using synchronized multi-view camera setups and provided by the dataset.
Numbers in parentheses indicate the range of observed views. Note that SHERF only accepts a single input view.

Results with accurate body poses on DNA-Rendering

The accurate poses are estimated using synchronized multi-view camera setups and provided by the dataset.
Numbers in parentheses indicate the range of observed views. Note that SHERF only accepts a single input view.

Results with estimated body poses on HuMMan

The estimated poses are estimated directly from the input views using HybrIK (poses shown in red).
Numbers in parentheses indicate the range of observed views. Note that SHERF only accepts a single input view.

Results with estimated body poses on DNA-Rendering

The estimated poses are estimated directly from the input views using HybrIK (poses shown in red).
Numbers in parentheses indicate the range of observed views. Note that SHERF only accepts a single input view.

Method Overview

To condition the canonical neural field on the observed views, we extract three types of features: global \( f_\textrm{glob} \), voxel-based \( f_\textrm{vox} \), and pixel-aligned \( f_\textrm{pix} \). The three features have complementary strengths and tackle different key challenges: \( f_\textrm{vox} \) can resolve occlusions, inject prior and compensate for slight pose inaccuracies; \( f_\textrm{glob} \) captures overall characteristics and appearance of the subject through a flat (1D) latent code, which further facilitates prior injection and reconstruction of unobserved regions; \( f_\textrm{pix} \), when available, provides direct, high-quality appearance information.

First, we extract the 2D featuremaps \( F_t \), which we pass through our VoluMorph module using initial heuristic motion weights to get the final motion weights \( W \). The features \( F_t \) and motion weights \( W \) are passed to a second VoluMorph module, which outputs the volume \( V \) and a global latent code. Finally, we extract \( f_\textrm{vox}, f_\textrm{glob}, f_\textrm{pix} \) and combine them using the feature fusion module to condition the NeRF MLP.

The VoluMorph module

The global and voxel-based features are produced by our 3D VoluMorph encoding module, which lifts each observed view of the body into a partial canonical model and combines these into a complete canonical representation at a coarse level.

The initial step in VoluMorph is the unprojection of the 2D featuremaps into 3D. We then align the initial feature volumes to the canonical pose with a volume undeformation operation. The aligned partial models (volumes) are combined into a single, complete model by a 3D U-Net-based convolutional network with attention-based aggregation between views.

The key feature of this module is that it can learn a semantic understanding of the body and can therefore capture and inject prior knowledge as well as (to some extent) resolve occlusions.

BibTeX

@inproceedings{zadrozny2025hummorph,
  author    = {Zadro{\.z}ny, Jakub and Bilen, Hakan},
  title     = {HumMorph: Generalized Dynamic Human Neural Fields from Few Views},
  booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference},
  year      = {2025},
  pages     = {348--357},
}

Acknowledgments

This work was supported by the United Kingdom Research and Innovation (grant EP/S023208/1), UKRI Centre for Doctoral Training in Robotics and Autonomous Systems at the University of Edinburgh, School of Informatics. HB was supported by the EPSRC Visual AI grant EP/T028572/1.