JRM: Joint Reconstruction Model for Multiple Objects without Alignment

We address the challenge of compositional scene reconstruction with objects re-observed across space and time. We characterize this into three concrete cases: spatial repetition, temporal repetition, and articulation dynamics. We propose the Joint Reconstruction Model (JRM) to perform coupled reconstruction of a group of objects, outperforming reconstruction of each individually.

Abstract

Object-centric reconstruction seeks to recover the 3D structure of a scene through composition of independent objects. While this independence can simplify modeling, it discards strong signals that could improve reconstruction, notably repetition where the same object model is seen multiple times in a scene, or across scans.

We propose the Joint Reconstruction Model (JRM) to leverage repetition by framing object reconstruction as one of personalized generation: multiple observations share a common subject that should be consistent for all observations, while still adhering to the specific pose and state from each.

Prior methods in this direction rely on explicit matching and rigid alignment across observations, making them sensitive to errors and difficult to extend to non-rigid transformations. In contrast, JRM is a 3D flow-matching generative model that implicitly aggregates unaligned observations in its latent space, learning to produce consistent and faithful reconstructions in a data-driven manner without explicit constraints.

Evaluations on synthetic and real-world data show that JRM's implicit aggregation removes the need for explicit alignment, improves robustness to incorrect associations, and naturally handles non-rigid changes such as articulation. Overall, JRM outperforms both independent and alignment-based baselines in reconstruction quality.

Pipeline Comparison

Comparison between different approaches to object-centric reconstruction. JRM offers a relaxation of explicit alignment and registration techniques. Objects are jointly reconstructed, allowing information flow between them, but without imposing hard constraints on similarity.

Method

Overview of JRM. Given multiple observations of object instances, JRM implicitly aggregates them in its latent space using a 3D flow-matching generative model. Without requiring explicit alignment or matching, the model learns to produce consistent and faithful 3D reconstructions that respect the specific pose and state of each observation.

Results on Real-World Image Sequences

Qualitative comparisons on real-world data

Reconstruction results on real scenes, Replica and ScanNet++. JRM achieves overall the best performance by only training on pairs of synthetic objects.

	Replica			ScanNet++
Methods	CD↓	NC↑	F1↑	CD↓	NC↑	F1↑
DP-Recon	4.65	74.87	71.95	5.53	72.47	65.98
FM ^[1]	3.74	79.28	79.21	4.20	78.60	72.96
JRM (Ours)	3.21	77.88	81.78	2.69	79.41	85.53

Results against Single-view Methods

BibTeX

@inproceedings{wu2026jrm,
  title={JRM: Joint Reconstruction Model for Multiple Objects without Alignment},
  author={Wu, Qirui and Siddiqui, Yawar and Frost, Duncan and Aroudj, Samir and Avetisyan, Armen and Newcombe, Richard and Chang, Angel X. and Engel, Jakob and Howard-Jenkins, Henry},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}