Abstract:
This paper, grounded in the technical concepts of deep conditional posteriors and semantic segmentation, and based on the development of flat image generation and differentiable rendering, proposes an art style three-dimensional object generation method based on diffusion models and its technical route. It lists several key technical issues and solutions, including the geometric issues of artistic style neural radiance fields, suppression of floating artifacts, and the key techniques for the regularity of the main object's geometric structure.
Keywords:
3D Generation; Artistic Style Modeling; Geometric Regularity; Neural Radiance Fields; Diffusion Models
0 Introduction
In the field of 3D model generation, many studies have deeply explored various 3D representation forms, such as 3D voxel grids, point clouds, meshes, implicit representations, and octree representations. These methods generally rely on training data in the form of 3D resources, but the acquisition of large-scale 3D resources is quite difficult. Thanks to the successful application of neural radiance fields (NeRF) technology, recent research has begun to focus on 3D perceptual image synthesis, which has the advantage of being able to learn and generate 3D models directly from images. Moreover, relying on differentiable rendering technology, neural radiance fields can be transformed into 3D asset forms suitable for industry.
Advertisement
On the other hand, text-to-image diffusion models have become advanced models in the field of image generation. Diffusion models simulate the physical diffusion phenomenon through the forward and reverse processes, achieving excellent visual effects. With the breakthrough of text-to-image generation models, text-to-3D generation has begun to receive widespread attention in academia. Many 3D generation methods use the image distribution generated by diffusion models to guide the generation of neural radiance fields. The existing methods of generating neural radiance fields under the guidance of diffusion models mainly include two types: score distillation sampling (SDS) and variational score distillation (VSD). Score distillation sampling extracts the pre-trained large-scale text-to-image diffusion model, showing great prospects in text-to-3D generation, but there are problems of over-saturation, over-smoothing, and low diversity. Wang et al. proposed ProlificDreamer, which models 3D parameters as random variables instead of constants in SDS, and proposed a variational framework based on the principle of particles—variational score distillation. ProlificDreamer can generate neural radiance fields with high rendering resolution and high fidelity, with rich structures and complex effects.
The existing diffusion model-based neural radiance field generation methods are mostly based on realistic images. When modeling based on artistic style images, it is difficult to generate the correct geometric structure, including a large number of floating artifacts and incorrect geometric structures, the reasons for which are described as follows.Firstly, when using diffusion models to generate images as guidance for neural radiance fields, data consistency is difficult to ensure. Neural radiance fields rely on real-world photos captured from multiple perspectives to learn the 3D structure and color of a scene. These photos often contain complex lighting and reflection characteristics that are consistent across multiple images. Diffusion model-generated images may exhibit different lighting, colors, and styles between different pictures.
Secondly, artistic style images have unique textures and lighting. Artistic style images often have unique material and lighting models that may not follow the laws of real-world physics. For example, shadows, highlights, and reflections may be artistic and not necessarily consistent in a physically correct manner across images. When neural radiance fields attempt to reconstruct a 3D scene based on these inconsistent visual cues, it may result in unrealistic geometric shapes or cause floating artifacts.
Thirdly, there is a difference in the frequency content of guidance images. For instance, cartoon images often contain large areas of uniform color and sharp boundaries, rather than the detailed textures and gradients found in real-world images. Neural radiance fields typically rely on the details and textures in the images to infer the depth and geometric information of the scene. This high contrast and low-frequency content may make it difficult for neural radiance fields to correctly infer continuous geometric structures.
Fourthly, diffusion model-generated images may lack perspective diversity, and the generated images may not provide enough perspective variation for neural radiance fields to capture accurate depth information. For example, cartoon images are usually hand-drawn and may not have accurate perspective variations corresponding to the real world. This will further exacerbate the inaccuracies in the reconstruction process.
To form three-dimensional models in an artistic style, modifications are needed to capture accurate depth information by neural radiance fields, to better adapt to artistic style images, or to develop 3D reconstruction techniques specifically for non-realistic images.1 Theoretical Foundation
2 Artistic Style of Three-Dimensional Object Geometry Regularization
Typically, the update rules of neural radiance fields include geometry regularization loss functions. These loss functions utilize the geometric information of neural radiance fields (usually depth, density, etc.) to regularize the parameters θ of the neural radiance fields, thereby achieving the purpose of geometric correction. A commonly used loss function for geometric regularization is:
(Note: The original text seems to be cut off and does not provide the specific loss function formula.)3 Application Case
Taking the illustration style as an example, the improved geometric regular variational fractional distillation method is applied to the generation of three-dimensional objects. Specifically, runwayml/stable-diffusion-v1-5 is used as the baseline text-image diffusion model, and on this basis, the DreamBooth method is fine-tuned using about 10 target object images, with additional depth conditions provided by lllyasviel/
Figure 2 The three-dimensional object generation method of this paper in artistic style
The qualitative samples of the artistic style three-dimensional object generation method based on the text-image diffusion model proposed in this paper are shown in Figure 3, where existing representative methods are set as the control group, and all generations are set to 10,000 iterations. Among the first three rows of images, the even-numbered images are density images obtained by sampling the 3D model of the previous image. Qualitatively speaking, the artistic style three-dimensional objects generated by this method have better quality. Specifically, the three-dimensional objects generated by this method are closer to the guiding images of the diffusion model in terms of texture style and color. In addition, in terms of geometric structure, it can be found from the density map that the three-dimensional objects generated by this method have a geometric structure more consistent with the target image, and almost completely suppress the floating artifacts.4 Future Challenges
At present, the technology for generating 3D objects based on text-image diffusion models is burgeoning. However, existing methods still have a gap in 3D generation quality compared to industrial production standards, including high-resolution generation, inference speed, multi-view consistency, and geometric consistency. This paper proposes an artistic style 3D object generation method based on text-image diffusion models, which shows superior geometric and texture generation effects in the task of artistic style 3D object generation compared to previous methods. However, due to the limitations of time and energy, there are still many shortcomings in this work, mainly including: ① Due to the introduction of additional inference models, although an accelerated convergence loss function was introduced, the proposed 3D object generation method has higher computational power requirements and higher inference latency; ② The 3D object generation method driven by the text-image diffusion model is still based on the diffusion model prior, so it is difficult to ensure the generation quality from the perspective of data consistency, and there are still many differences in lighting and reflection characteristics between multiple images; ③ The diffusion model generated images still lack perspective diversity, and the generated images cannot provide enough perspective changes for NeRF to capture sufficient geometric information.
Figure 3: Experimental results of this paper's method compared to existing representative methods.
Based on the advantages and shortcomings of this paper, improvements can also be made based on existing work. Follow-up research can consider the following aspects. First, optimize the suppression of floating artifacts based on semantic segmentation to improve the efficiency of the algorithm. In the task of 3D object generation, the task of semantic segmentation is simpler than that of complex scenes. The additional inference latency introduced by the introduction of complex models similar to SAM may be accelerated through fine-tuning and distillation for specific tasks. Second, when using 3D object generation methods guided by diffusion models, solve the multi-view consistency. The 3D object generation method guided by diffusion models lacks information about the guided picture view during training. In addition, in the task of stylized fine-tuning, small sample fine-tuning lacks diversity in perspective, resulting in multi-view consistency that is not as good as traditional neural radiance fields. Follow-up work should address multi-view consistency for 3D generation methods guided by distilled diffusion models.
5 Conclusion
Artistic style 3D object generation is a major vertical proposition in the field of 3D object generation. 3D object generation based on artistic style images can be applied to sketch-assisted design, non-photorealistic model construction, and other aspects. Existing methods based on diffusion model score distillation are difficult to generate 3D models on artistic images. This paper, based on the technical ideas of deep conditional posterior and semantic segmentation, stands on the development of flat image generation, differentiable rendering, and other fields, and proposes an artistic style 3D object generation method based on diffusion models and its technical ideas. It lists the problems in guiding artistic style 3D generation using diffusion models and proposes key technologies for artistic style 3D object generation based on floating artifact suppression and subject geometric structure regularization. Finally, this paper looks forward to the key problems and technical challenges still faced in the field of artistic style 3D generation, providing feasible directions for future research. (References omitted)
Xu HaoZhejiang University Master's Degree Graduate Student. Main research direction is digital content generation.
Gupeng Yun
Ph.D. from the Massachusetts Institute of Technology, Chief Scientist at Zhejiang Green Intelligent Technology Co., Ltd. Main research areas include computer-aided engineering and mechanical dynamics.
Excerpted from the "Communications of the Chinese Association for Artificial Intelligence"
Volume 14, Issue 4, 2024
Special Issue on Intelligent Creativity and Digital Art