What is it about?
A systematic literature review focusing on metamorphic testing (MT) for deep code models, often called Large Language Models for Code (LLM4Code). These AI models are transforming software engineering by performing tasks like code completion, defect detection, and summarization. The core of the review is to understand how MT is used to evaluate the robustness of these models. Robustness is a critical quality attribute because these models can yield different or incorrect results even when minor, semantically insignificant changes are made to the code, such as renaming a variable. Traditional testing methods often face the "oracle problem," where it's difficult to determine the correct output for every input. MT addresses this by applying semantic-preserving transformations to code (e.g., converting a for loop to a while loop, which doesn't change functionality) and then checking if the model's predictions remain consistent. The review analyzes 45 papers, detailing the types of transformations, application techniques, frequently tested models, programming tasks, datasets, and evaluation metrics. It also highlights current trends, challenges, and outlines future research directions in this field.
Featured Image
Photo by Growtika on Unsplash
Why is it important?
Why it is important: The importance of this topic stems from the revolutionary impact of deep code models, which are becoming integral to modern software engineering practices due to their high accuracy in code-related tasks. However, their practical applicability and trustworthiness critically depend on their robustness. If these models are not robust, they might fail to detect security vulnerabilities or properly repair bugs simply because a developer used different variable names, leading to significant reliability and security risks. For example, studies have shown that about 40% of code generated by GitHub Copilot can be affected by security vulnerabilities. Metamorphic testing is crucial because it effectively mitigates the "oracle problem" in software testing, a long-standing challenge, by evaluating the expected relationships between inputs and outputs rather than absolute correctness. This systematic literature review provides a much-needed, unified overview of metamorphic testing specifically for LLM4Code, addressing the previous fragmentation in research and terminology. Its findings are essential for guiding future research and development, ensuring that deep code models are comprehensively evaluated for robustness and other critical quality attributes like security, privacy, and explainability, ultimately fostering more reliable and trustworthy AI in software development.
Perspectives
We believe that the results of our study can play a significant role in guiding future research and development in this rapidly evolving area. We aim for their work to serve as a foundation for future research and contribute toward more rigorous, generalizable, and practically applicable evaluation of code-focused large language models.
Ali Asgari
Technische Universiteit Delft
Read the Original
This page is a summary of: Metamorphic Testing of Deep Code Models: A Systematic Literature Review, ACM Transactions on Software Engineering and Methodology, September 2025, ACM (Association for Computing Machinery),
DOI: 10.1145/3766552.
You can read the full text:
Resources
Contributors
The following have contributed to this page







