CaPO

Abstract

Aligning text-to-image (T2I) diffusion models with preference optimization is valuable for human-annotated datasets, but the heavy cost of manual data collection limits scalability. Using reward models offers an alternative, however, current preference optimization methods fall short in exploiting the rich information, as they only consider pairwise preference distribution. Furthermore, they lack generalization to multi-preference scenarios and struggle to handle inconsistencies between rewards. To address this, we present Calibrated Preference Optimization (CaPO), a novel method to align T2I diffusion models by incorporating the general preference from multiple reward models without human annotated data. The core of our approach involves a reward calibration method to approximate the general preference by computing the expected win-rate against the samples generated by the pretrained models. Additionally, we propose a frontier-based pair selection method that effectively manages the multi-preference distribution by selecting pairs from Pareto frontiers. Finally, we use regression loss to fine-tune diffusion models to match the difference between calibrated rewards of a selected pair. Experimental results show that CaPO consistently outperforms prior methods, such as Direct Preference Optimization (DPO), in both single and multi-reward settings validated by evaluation on T2I benchmarks, including GenEval and T2I-Compbench.

Proposed Method

(a) Synthetic data generation

Instead of using human-annotated preference dataset, we generate group of images from prompt, and compute reward signals for each reward models such as image aesthetic score (VILA), image-text alignment score (VQAscore), and human preference score (MPS). We find that using scaling the prompt and the number of images are crucial for performance gain, where we use 100K prompts from DiffusionDB and generate 16 images per prompt.

(b) Reward Calibration

Instead of using the reward model directly, we calibrate the reward model to approximate the general preference by computing the expected win-rate against the samples generated by the pretrained models. Specifically, we compute the approximate win-rate by using Bradley-Terry model to compute the statistics of group. Note that this calibration has following advantages: (1) it allows us to bound the range of scores into [0,1], which helps the balanced optimization by lowering the variance within reward scores. (2) the calibrated rewards act as a global metric where higher calibrated reward represents better performance across different prompts, whereas the higher reward score itself does not necessarily means the better model.

(c) Pair Selection

For single reward case, it is straightforward to select the pairs with the highest difference in calibrated rewards. On the other hand, for multi-reward case, we need to select the pairs that are Pareto optimal in the multi-dimensional space. We propose a frontier-based pair selection method that effectively manages the multi-preference distribution by selecting pairs from Pareto frontiers. Specifically, we compute the Pareto frontiers of the calibrated rewards, and select the positives (i.e., non-dominated set) from the upper frontier and negatives (i.e., dominated set) from the lower frontier. After removing the potential duplicate, we randomly choose the pairs from the selected positives and negatives for each training step.

(d) Train with CaPO Loss

We introduce a calibrated preference optimization (CaPO) loss to align the diffusion models with the calibrated rewards. Given calibrated rewards of a pair, we use regression loss to fine-tune diffusion models to match the difference between calibrated rewards of a selected pair. Specifically, we match the difference of implicit rewards (i.e., the log-ratio between the fine-tuning and pretrained model) to match the difference of calibrated rewards. Compared to DPO and IPO, which leverage the discrete preference label for training, our objective accounts the difference of calibrated rewards, which allows us to optimize the model without suffering from overfitting.

Results

Single reward results

When using single reward model to align diffusion models (SDXL and SD3-medium), CaPO achieves the highest win-rate compared to the baselines DPO and IPO. Especially, when using VILA, DPO shows drop in VQAscore, while CaPO shows comparable performance to the base model. This shows that CaPO is more robust to the reward model due to the reward calibration.

Multi-reward results

When using multiple reward models to align diffusion models (SDXL and SD3-medium), CaPO with frontier-based rejection sampling (FRS) achieves the highest win-rate and scores compared to other baselines. Notably, we compare with SUM, which use the sum of rewards as a proxy for unified reward, and SOUP, which use model soup to combine the models aligned with individual reward models. CaPO with FRS outperforms SUM and SOUP, showing the effectiveness of our method in handling multi-preference problem.

Benchmark results

We evaluate the performance of CaPO on T2I-CompBench and GenEval, which are the benchmarks for T2I diffusion models. Note that CaPO shows significant boost in T2I-CompBench and GenEval for both SDXL and SD3-medium, respectively. Also, we observe that CaPO+SD3-medium shows comparable performance compared to open-source large T2I diffusion models such as FLUX or SD3.5-large.

BibTeX

@inproceedings{lee2025calibrated,
    title={Calibrated multi-preference optimization for aligning diffusion models},
    author={Lee, Kyungmin and Li, Xiahong and Wang, Qifei and He, Junfeng and Ke, Junjie and Yang, Ming-Hsuan and Essa, Irfan and Shin, Jinwoo and Yang, Feng and Li, Yinxiao},
    booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
    year={2025}
}

Calibrated Multi-Preference Optimization for Aligning Diffusion Models