Recent progress in text-to-3D generation has been achieved through the utilization of score distillation methods: they make use of the pre-trained text-to-image (T2I) diffusion models by distilling via the diffusion model training objective. However, such an approach inevitably results in the use of random timesteps at each update, which increases the variance of the gradient and ultimately prolongs the optimization process. In this paper, we propose to enhance the text-to-3D optimization by leveraging the T2I diffusion prior in the generative sampling process with a predetermined timestep schedule. To this end, we interpret text-to-3D optimization as a multi-view image-to-image translation problem, and propose a solution by approximating the probability flow. By leveraging the proposed novel optimization algorithm, we design DreamFlow, a practical three-stage coarse-to-fine text-to-3D optimization framework that enables fast generation of high-quality and high-resolution (i.e., 1024 x 1024) 3D contents. For example, we demonstrate that DreamFlow is 5 times faster than the existing state-of-the-art text-to-3D method, while producing more photorealistic 3D contents.
We introduce approximate probability flow ODE (APFO), which approximates the probability flow to update 3D model. Given an image rendered from a 3D model, we optimize the image by transporting it to the high-density region of text-to-image diffusion model using Schrodinger bridge. We model the score function of current scene by additional fine-tuning, and compute the approximate probability flow by the different of the denoiser outputs of pretrained model and fine-tuned model. Alike conventional diffusion sampler, we use decreasing sequence of timesteps to guarantee convergence, and amortize the optimization to optimize the multi-view images in a 3D scene.
The proposed framework, DreamFlow, perform coarse-to-fine text-to-3D optimization for high-quality 3D content generation. We first optimize NeRF (e.g., using hash-grid encoder) using latent diffusion model (e.g., Stable Diffusion v2.1) with resolution of 256x256, with timesteps decreasing from 1.0 to 0.2. Then, we extract a 3D mesh from stage 1 for efficient 3D modeling, and optimize 3D mesh with resolution of 512x512 using same denoiser of stage 1, with timesteps decreasing from 0.5 to 0.1. Lastly at stage 3, we refine the 3D mesh using diffusion refiner (e.g., Stable Diffusion XL refiner), to generate 3D mesh in resolution of 1024x1024. 3D mesh refinement significantly enhance the photorealism of 3D model, compared to prior methods.
A sliced loaf of fresh bread.
A corgi standing up drinking boba.
An imperial state crown of England.
A beautiful dress made out of garbage bags, on a mannequin.
A 3D model of adorable cottage with a thatched roof.
A silver platter piled high with fruits.
A tarantula, highly detailed.
A tiger eating an ice cream cone.
A tiger dressed as a doctor.
Wedding dress made out of tenacles.