DiffInDScene: Diffusion-based High-Quality 3D Indoor Scene Generation

1MMLab, The Chinese University of Hong Kong
, 2Zhejiang University
, 3Shanghai AI Laboratory

*Indicates Equal Contribution


Demonstration of our whole pipeline and the generation results.


Coarse-to-fine indoor scene geometry generation using a sparse diffusion framework. For better visualization, the texture is produced by DreamSpace after the scene geometry is generated by our DiffInDScene.

Indoor tour of generated scenes after texturing.

Another example of indoor tour.


We present DiffInDScene, a novel framework for tackling the problem of high-quality 3D indoor scene generation, which is challenging due to the complexity and diversity of the indoor scene geometry. Although diffusion-based generative models have previously demonstrated impressive performance in image generation and object-level 3D generation, they have not yet been applied to room-level 3D generation due to their computationally intensive costs. In DiffInDScene, we propose a cascaded 3D diffusion pipeline that is efficient and possesses strong generative performance for Truncated Signed Distance Function (TSDF). The whole pipeline is designed to run on a sparse occupancy space in a coarse-to-fine fashion. Inspired by KinectFusion's incremental alignment and fusion of local TSDF volumes, we propose a diffusion-based SDF fusion approach that iteratively diffuses and fuses local TSDF volumes, facilitating the generation of an entire room environment. The generated results demonstrate that our work is capable to achieve high-quality room generation directly in three-dimensional space, starting from scratch. In addition to the scene generation, the final part of DiffInDScene can be used as a post-processing module to refine the 3D reconstruction results from multi-view stereo. According to the user study, the mesh quality generated by our DiffInDScene can even outperform the ground truth mesh provided by ScanNet.



As shown in (a), we employ a cascaded diffusion model to generate the whole room in a coarse-to-fine manner. The first stage is to generate the coarse structure of the whole room. The following stages further refine the rough shape to a 3D occupancy field with higher resolutions. At the final stage, the resolution increases to the highest level, and we crop the whole scene to overlapped pieces to generate the final de- tailed Truncated Signed Distance Function (TSDF) volume. In every stage, we use a separate sparse diffusion model to reduce the resource consumption, which exclusively denoises on sparsely distributed occupancy.

To obtain hierarchical occupancy embeddings for latent diffusion in (a), we design a multi-scale Patch-VQGAN as (b), where latents with lower resolution can be decoded to the latent of higher resolution with sparse occupancy. Such latent representation enables the diffusion model to prune occupancy with increasing resolutions.

Unconditional Generation Result


Sketch-conditioned Generation Result


The first row is the binary sketch images as condition input of generation, and the second row is the corresponding generation results. The sketch data is produced by cutting through the middle of 3D occupancy along the up axis. The cutting height is randomly sampled during training. As shown in this figure, the black line is the bird-eye-view projection of occupied voxels.

Extension as a Reconstruction Refiner


Pipeline: utilizing the 3rd stage of our model to refine the reconstruction result of multi-view stereo. Please refer Section 3.4 and Section 4.3 of our paper for details.


The refinement results of MVS are demonstrated in this figure, where the meshes in first row are colored according to curvatures, where green denotes lower value.


        title={DiffInDScene: Diffusion-based High-Quality 3D Indoor Scene Generation},
        author={Ju, Xiaoliang and Huang, Zhaoyang and Li, Yijin and Zhang, Guofeng and Qiao, Yu and Li, Hongsheng},
        booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},