High-level 3D scene understanding is essential in many applications. However, the challenges of generating accurate 3D annotations make development of deep learning models difficult. We turn to recent advancements in automatic retrieval of synthetic CAD models, and show that data generated by such methods can be used as high-quality ground truth for training supervised deep learning models. More exactly, we employ a pipeline akin to the one previously used to automatically annotate objects in ScanNet scenes with their 9D poses and CAD models. This time, we apply it to the recent ScanNet++ v1 dataset, which previously lacked such annotations. Our findings demonstrate that it is not only possible to train deep learning models on these automatically-obtained annotations but that the resulting models outperform those trained on manually annotated data. We validate this on two distinct tasks: point cloud completion and single-view CAD model retrieval and alignment. Our results underscore the potential of automatic 3D annotations to enhance model performance while significantly reducing annotation costs. To support future research in 3D scene understanding, we release our annotations, which we call SCANnotate++, along with our trained models.
Our pipeline begins with an RGB-D scan and the corresponding 3D instance segmentation labels. We extract bounding boxes for each segmented object and use them to initialize coarse object poses.
We then perform CAD model retrieval and pose estimation using HOC-Search, an algorithm that conducts joint discrete and continuous optimization over shape and pose parameters. It searches a tree-structured index of CAD models (the HOC-Tree) built from ShapeNet using shape similarity clustering, and iteratively refines pose using a differentiable render-and-compare loss.
The objective function used during optimization combines depth similarity, silhouette overlap, and chamfer distance between the observed data and rendered CAD models. Gradient-based refinement improves pose alignment after each search iteration.
To increase annotation consistency, we include a clustering and cloning stage where objects of the same class and similar shape within a scene are assigned the same CAD model. Final poses for cloned instances are then re-optimized.
Finally, all annotations undergo manual quality verification. Erroneous alignments are corrected and the pose refinement is re-run to ensure high fidelity of the final SCANnotate++ dataset.
We evaluate SCANnotate++ annotations on two tasks: supervised point cloud completion and CAD model retrieval and alignment. Models trained on our annotations outperform those trained on manual annotations from Scan2CAD, validating the quality of our dataset. The training and testing code will be made publicly available shortly.
Given an input partial point cloud of objects in ScanNet and ScanNet++, we want to reconstruct the missing 3D points. We use automatic annotations from SCANnotate and SCANnotate++ to generate complete point clouds as ground truth for supervised learning of point cloud completion and present our visual results in the above figure. Top row shows partial point cloud input, middle row shows the reconstructed point cloud, and bottom row shows the ground truth point cloud.
Given a single image, we want to retrieve and align 3D CAD models, across the target categories. ROCA is a framework that, from a single image, retrieves and aligns CAD models from a pool of per-scene objects. We train ROCA using both SCANnotate and SCANnotate++, and present the visual results in the figure above.
@article{rao2025leveraging,
author = {Rao, Yuchen and Ainetter, Stefan and Stekovic, Sinisa and Lepetit, Vincent and Fraundorfer, Friedrich},
title = {Leveraging Automatic CAD Annotations for Supervised Learning in 3D Scene Understanding},
journal = {arXiv preprint arXiv:2504.13580},
year = {2025}
}