What is it about?
We introduce a novel, large-scale underwater video collection designed to enhance video understanding of marine environments. This dataset, featuring 396 video-segmentation mask-text triplets, was meticulously annotated using a novel two-stage pipeline. It establishes a new benchmark for video understanding tasks like video grounding, captioning, and text-to-video generation, directly addressing current AI models' inaccuracies and 'hallucinations' in marine video analysis.
Featured Image
Photo by Leo_Visions on Unsplash
Why is it important?
Our MSC dataset is useful because it helps computers understand what is happening in video segments much better, giving them a deeper understanding of ocean scenes. Plus, advanced AI models struggle to understand marine videos. Specifically, they sometimes make up things or get details wrong, even when they are trying to describe what they see. Our new video collection helps solve this problem, making these programs more accurate and less likely to "hallucinate" or invent false information.
Perspectives
This research is motivated by two key questions: why a new video captioning dataset is needed and why we are focusing on clip-level captioning. These questions are addressed in this paper. This work is truly interesting and promises to make significant progress in marine understanding, thanks to the close collaboration between AI researchers and marine biologists in building this dataset together.
Quang Trung Truong
Hong Kong University of Science and Technology
Read the Original
This page is a summary of: MSC: A Marine Wildlife Dataset for Video Understanding with Grounded Segmentation and Clip-Level Captions, October 2025, ACM (Association for Computing Machinery),
DOI: 10.1145/3746027.3758198.
You can read the full text:
Resources
Contributors
The following have contributed to this page







