DreamFlow: High-quality text-to-3D generation by Approximating Probability Flow

1KAIST, 2Google Research
ICLR 2024 (Spotlight)

Abstract

Recent progress in text-to-3D generation has been achieved through the utilization of score distillation methods: they make use of the pre-trained text-to-image (T2I) diffusion models by distilling via the diffusion model training objective. However, such an approach inevitably results in the use of random timesteps at each update, which increases the variance of the gradient and ultimately prolongs the optimization process. In this paper, we propose to enhance the text-to-3D optimization by leveraging the T2I diffusion prior in the generative sampling process with a predetermined timestep schedule. To this end, we interpret text-to-3D optimization as a multi-view image-to-image translation problem, and propose a solution by approximating the probability flow. By leveraging the proposed novel optimization algorithm, we design DreamFlow, a practical three-stage coarse-to-fine text-to-3D optimization framework that enables fast generation of high-quality and high-resolution (i.e., 1024 x 1024) 3D contents. For example, we demonstrate that DreamFlow is 5 times faster than the existing state-of-the-art text-to-3D method, while producing more photorealistic 3D contents.

Approximating Probability Flow ODE

Our method, approximate probability flow ODE (APFO), uses predetermined timestep schedule in contrast to score distillation sampling (SDS), and amortize optimization to update multi-view images in 3D model. We fine-tune denoiser to accurately compute the probability flow.

We introduce approximate probability flow ODE (APFO), which approximates the probability flow to update 3D model. Given an image rendered from a 3D model, we optimize the image by transporting it to the high-density region of text-to-image diffusion model using Schrodinger bridge. We model the score function of current scene by additional fine-tuning, and compute the approximate probability flow by the different of the denoiser outputs of pretrained model and fine-tuned model. Alike conventional diffusion sampler, we use decreasing sequence of timesteps to guarantee convergence, and amortize the optimization to optimize the multi-view images in a 3D scene.

Coarse-to-fine text-to-3D optimization of DreamFlow

Our text-to-3D generation is done in coarse-to-fine manner; we first optimize NeRF, then extract 3D mesh and fine-tune. We use same latent diffusion model (denoiser 1) for first and second stage. Lastly, we refine 3D mesh with high-resolution latent diffusion prior (denoiser 2). At each stage, we optimize with different timestep schedule, which effectively utilize the diffusion priors.

The proposed framework, DreamFlow, perform coarse-to-fine text-to-3D optimization for high-quality 3D content generation. We first optimize NeRF (e.g., using hash-grid encoder) using latent diffusion model (e.g., Stable Diffusion v2.1) with resolution of 256x256, with timesteps decreasing from 1.0 to 0.2. Then, we extract a 3D mesh from stage 1 for efficient 3D modeling, and optimize 3D mesh with resolution of 512x512 using same denoiser of stage 1, with timesteps decreasing from 0.5 to 0.1. Lastly at stage 3, we refine the 3D mesh using diffusion refiner (e.g., Stable Diffusion XL refiner), to generate 3D mesh in resolution of 1024x1024. 3D mesh refinement significantly enhance the photorealism of 3D model, compared to prior methods.

Results

A sliced loaf of fresh bread.

A corgi standing up drinking boba.

An imperial state crown of England.

A beautiful dress made out of garbage bags, on a mannequin.

A 3D model of adorable cottage with a thatched roof.

A silver platter piled high with fruits.

A tarantula, highly detailed.

A tiger eating an ice cream cone.

A tiger dressed as a doctor.

Wedding dress made out of tenacles.