Multimodal Data Augmentation for Image Captioning using Diffusion Models

Changrong Xiao; Sean Xin Xu; Kunpeng Zhang

doi:10.1145/3607827.3616839

What is it about?

Image captioning, an important vision-language task, often requires a tremendous number of finely labeled image-caption pairs for learning the underlying alignment between images and texts. In this paper, we proposed a multimodal data augmentation method, leveraging the text-to-image model, Stable Diffusion, to expand the training set via high-quality generation of image-caption pairs. Extensive experiments on the MS COCO dataset demonstrate the advantages of our approach over several benchmark methods, and particularly a significant boost when having fewer training instances. In addition, models trained on our augmented datasets also outperform prior unpaired image captioning methods by a large margin. Finally, further improvement regarding the training efficiency and effectiveness can be obtained after intentionally filtering the generated data based on quality assessment.

This page is a summary of: Multimodal Data Augmentation for Image Captioning using Diffusion Models, October 2023, ACM (Association for Computing Machinery),
DOI: 10.1145/3607827.3616839.
You can read the full text:

Read

Contributors

The following have contributed to this page

Changrong Xiao
Tsinghua University

Multimodal Data Augmentation for Image Captioning using Diffusion Models

What is it about?

Contributors

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management

Multimodal Data Augmentation for Image Captioning using Diffusion Models

What is it about?

Featured Image

Read the Original

Contributors

Share this page:

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management