What is it about?

Our work aims to make AI systems better at thinking and correcting themselves. We focus on vision–language models — AI that can see an image and explain or reason about it in natural language. While these models are powerful, they often make small mistakes that lead to wrong answers, such as misreading a number or making a simple calculation error. To fix this, we introduce a feedback-based framework where one model acts as the "actor" to solve a problem, and another model, the "critic," reviews its reasoning and gives natural-language feedback. The actor then improves its answer step by step based on these critiques. This process helps the AI learn to reason more reliably, similar to how humans reflect on their own thinking.

Featured Image

Read the Original

This page is a summary of: MMC: Iterative Refinement of VLM Reasoning via MCTS-based Multimodal Critique, October 2025, ACM (Association for Computing Machinery),
DOI: 10.1145/3728422.3762145.
You can read the full text:

Read

Contributors

The following have contributed to this page