We address the challenge of compositional scene reconstruction with objects re-observed across space and time. We characterize this into three concrete cases: spatial repetition, temporal repetition, and articulation dynamics. We propose the Joint Reconstruction Model (JRM) to perform coupled reconstruction of a group of objects, outperforming reconstruction of each individually.
Object-centric reconstruction seeks to recover the 3D structure of a scene through composition of independent objects. While this independence can simplify modeling, it discards strong signals that could improve reconstruction, notably repetition where the same object model is seen multiple times in a scene, or across scans.
We propose the Joint Reconstruction Model (JRM) to leverage repetition by framing object reconstruction as one of personalized generation: multiple observations share a common subject that should be consistent for all observations, while still adhering to the specific pose and state from each.
Prior methods in this direction rely on explicit matching and rigid alignment across observations, making them sensitive to errors and difficult to extend to non-rigid transformations. In contrast, JRM is a 3D flow-matching generative model that implicitly aggregates unaligned observations in its latent space, learning to produce consistent and faithful reconstructions in a data-driven manner without explicit constraints.
Evaluations on synthetic and real-world data show that JRM's implicit aggregation removes the need for explicit alignment, improves robustness to incorrect associations, and naturally handles non-rigid changes such as articulation. Overall, JRM outperforms both independent and alignment-based baselines in reconstruction quality.
Comparison between different approaches to object-centric reconstruction. JRM offers a relaxation of explicit alignment and registration techniques. Objects are jointly reconstructed, allowing information flow between them, but without imposing hard constraints on similarity.
Overview of JRM. Given multiple observations of object instances, JRM implicitly aggregates them in its latent space using a 3D flow-matching generative model. Without requiring explicit alignment or matching, the model learns to produce consistent and faithful 3D reconstructions that respect the specific pose and state of each observation.
Reconstruction results on real scenes, Replica and ScanNet++. JRM achieves overall the best performance by only training on pairs of synthetic objects.
| Replica | ScanNet++ | |||||
|---|---|---|---|---|---|---|
| Methods | CD↓ | NC↑ | F1↑ | CD↓ | NC↑ | F1↑ |
| DP-Recon | 4.65 | 74.87 | 71.95 | 5.53 | 72.47 | 65.98 |
| FM [1] | 3.74 | 79.28 | 79.21 | 4.20 | 78.60 | 72.96 |
| JRM (Ours) | 3.21 | 77.88 | 81.78 | 2.69 | 79.41 | 85.53 |
@inproceedings{wu2026jrm,
title={JRM: Joint Reconstruction Model for Multiple Objects without Alignment},
author={Wu, Qirui and Siddiqui, Yawar and Frost, Duncan and Aroudj, Samir and Avetisyan, Armen and Newcombe, Richard and Chang, Angel X. and Engel, Jakob and Howard-Jenkins, Henry},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}