MuLan: Multimodal-LLM Agent for Progressive Multi-object Diffusion

AIGC Research Collaboration
HKUST - UCLA - PSU - UMD

We propose a training-free controllable text-to-image (T2I) generation framework named MuLan utilizing powerful multimodal LLM. MuLan takes full control over the generation process by progressively generating the objects. During the generation, MuLan also allows adaptive self-correction due to the closed-loop feedback provided by multimodal LLM. Moreover, MuLan does not require in-context learning to prompt multimodal LLM.

How MuLan works.

MY ALT TEXT

Illustration of examples.

Abstract

Existing text-to-image models still struggle to generate images of multiple objects, especially in handling their spatial positions, relative sizes, overlapping, and attribute bindings. In this paper, we develop a training-free Multimodal-LLM agent (MuLan) to address these challenges by progressive multi-object generation with planning and feedback control, like a human painter. MuLan harnesses a large language model (LLM) to decompose a prompt to a sequence of sub-tasks, each generating only one object conditioned on previously generated objects by stable diffusion. Unlike existing LLM-grounded methods, MuLan only produces a high-level plan at the beginning while the exact size and location of each object are determined by an LLM and attention guidance upon each sub-task. Moreover, MuLan adopts a vision-language model (VLM) to provide feedback to the image generated in each sub-task and control the diffusion model to re-generate the image if it violates the original prompt. Hence, each model in every step of MuLan only needs to address an easy sub-task it is specialized for. We collect 200 prompts containing multi-objects with spatial relationships and attribute bindings from different benchmarks to evaluate MuLan. The results demonstrate the superiority of MuLan in generating multiple objects over baselines.

MuLan

The intuition behind MuLan is, one can imagine that if a human painter wants to create an artwork, he/she will first make a high-level plan, then paint objects one after another following the planning, and also correct possible mistakes after each painting stage. This is actually how MuLan works. Below is the complete framework with an example.

MY ALT TEXT

The proposed training-free Multimodal-LLM Agent (MuLan) for Progressive Multi-Object Diffusion. MuLan consists of three main components: (1) LLM planning; (2) Single-object diffusion with attention guidance; and (3) VLM-feedback control.

Conditional Single-Object Diffusion

At each stage, MuLan only focuses on generating a single object with attention guidance, conditioned on previously generated objects. To this end, MuLan first utilizes the LLM planner to determine the rough position of the object and the total number of objects in the same position. Then MuLan derives a rough mask for the object based on the rough position, the total number, and the precise mask of the previous object which can be easily computed. The rough mask is in the form of a bounding box, indicating the region in which the object should be generated and positioned.

With the rough mask, MuLan adopts backward guidance to manipulate attention maps during denoising steps of diffusion models to ensure the object would be generated and positioned correctly.

After each generation stage, a VLM is utilized to evaluate if the generated image aligns with the input prompt. If the generated image violates the input, MuLan can adaptively adjust the diffusion model to re-generate the object.

An illustration of the single-object diffusion is shown as follows. For more detailed procedure, please refer to Algorithm 1 in the paper.

MY ALT TEXT

Single-object diffusion with LLM planning and attention guidance.

Results

To evaluate the performance of MuLan, we curate a prompt dataset in which each prompt contains multiple objects with both attribute bindings and spatial relationships. Specifically, the dataset consists of all complex spatial prompts from T2I-CompBench and complex prompts generated by ChatGPT. For comparision, we compare MuLan with controllable generation methods and general state-of-the-art T2I diffusion models.

Since the performance of different methods is evaluated by the alignment between the input prompt and the generated image, we adopt both GPT-4V evaluation and human evaluation to comprehensively investigate the alignment. Each prompt-image pair is evaluated from three aspects, the objectness completeness, the correctness of attribute bindings, and the correctness of spatial relationships.

MY ALT TEXT

GPT-4V / human evaluation of images generated by different methods.

Ablation on the VLM feedback control

To investigate the importance and effect of the VLM feedback control, we conduct extensive ablation study. First, we evaluate the performance of MuLan by removing the feedback control. Then we also test the compatibility of MuLan with different VLMs. The results are shown as follows.

MY ALT TEXT

The left table shows the VLM feedback is a key component in MuLan; the right table shows that MuLan has good compatibility and could maintain good performance with different VLMs.

More Visualization Results

More visualization results of different methods.

MY ALT TEXT

BibTeX

@misc{li204mulan,
        title={MuLan: Multimodal-LLM Agent for Progressive Multi-Object Diffusion},
        author={Sen Li and Ruochen Wang and Cho-Jui Hsieh and Minhao Cheng and Tianyi Zhou},
        year={2024},
        eprint={???},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
      }