What is it about?

Cross-domain image composition aims at seamlessly putting one or multiple user-specified objects into different visual scenes even if the objects come from different domains. We introduce a new framework dubbed TALE that leverages pretrained text-to-image diffusion models to tackle the challenge without the need for training.

Featured Image

Why is it important?

Traditional methods often require training additional modules or finetuning diffusion models on specialized datasets, which can be costly and might not fully utilize the strengths of pre-trained diffusion models. Some recent approaches have tried to avoid these issues by finding ways to work without training, using attention maps to guide the image generation process indirectly. However, relying solely on attention maps for composition does not always yield desired results. These methods often struggle to preserve identity characteristics of input objects or exhibit limited background-to-object style adaptation in generated images. On the other hand, TALE is a new framework that operates directly on latent space to offer clear and effective guidance during the composition process, addressing these challenges. TALE incorporates two key mechanisms: Adaptive Latent Manipulation and Energy-guided Latent Optimization. The former formulates noisy latents using inverted background and foreground latents at a selective timestep conducive to initiating and steering the composition process. The latter complements the former by using specially designed energy functions to further optimize intermediate latents to refine the style of final results while remaining consistent with input prompts. Our experiments demonstrate that TALE surpasses prior baselines and attains state-of-the-art performance in image composition across various photorealistic and artistic domains.

Perspectives

This is my first paper which I feel extremely proud and accomplished to be the first author. It has been a truly memorable and rewarding journey exploring this research topic. I hope that this paper can make a modest contribution to the field and inspire future related works.

Mr. Kien T. Pham
Hong Kong University of Science and Technology

Read the Original

This page is a summary of: TALE: Training-free Cross-domain Image Composition via Adaptive Latent Manipulation and Energy-guided Optimization, October 2024, ACM (Association for Computing Machinery),
DOI: 10.1145/3664647.3681079.
You can read the full text:

Read

Contributors

The following have contributed to this page