What is it about?
Many artificial intelligence (AI) systems are now used to study human cells and support medical research. However, these models are often trained on datasets that do not include equal numbers of people from different ethnic groups, ages, or sexes. When the data are imbalanced, the AI can accidentally learn patterns that reflect who is most represented rather than true biology. This can lead to errors that especially affect groups that are already underrepresented in healthcare. In this project, we measure how much demographic information these models pick up by using a tool called iLISI. This measurement checks whether the model keeps cells from different groups mixed together or whether it separates them based on demographics in ways that should not happen. In some analyses, we found that cells from Hispanic donors clustered apart from non Hispanic donors even when there was no biological reason for them to do so. To reduce these biases, we used a method called scDesign3 to generate synthetic cells that help balance the dataset. These synthetic cells look statistically similar to real data but improve representation of groups with fewer samples. Adding this data made the model’s representations more mixed and reduced demographic specific clustering. Overall, this research shows how AI models can unintentionally encode demographic bias and how carefully designed synthetic data can help correct it.
Featured Image
Photo by Markus Winkler on Unsplash
Why is it important?
As single cell technologies become more common in research and clinical settings, biased AI models pose real risks. If a model works well only for groups it has seen most often, it may fail for patients from underrepresented backgrounds, widening existing health disparities. By showing how demographic imbalance shapes the internal behavior of widely used models like scGPT and Geneformer, this work provides a framework for identifying and reducing these biases before deployment. Our synthetic data approach offers a practical way to improve fairness without requiring large new datasets, which is especially valuable in low resource environments. The findings highlight the need for demographic aware data curation standards and demonstrate that better representation directly improves AI reliability. This work contributes to building medical AI systems that serve all populations more equitably.
Perspectives
From my perspective, this project showed how easily AI systems can learn demographic signals that researchers never intended to encode. Even subtle imbalances in the data produced noticeable differences in how models treated cells from different groups. Seeing this firsthand reinforced the importance of designing biological AI models that reflect the diversity of real patients. Working on synthetic data augmentation also demonstrated that fairness interventions are possible and can be evaluated rigorously. My hope is that this work encourages the community to treat demographic representation as a core part of model development, not an afterthought, so that future diagnostic tools can serve everyone more reliably.
Fernando Peralta Castro
Brown University
Read the Original
This page is a summary of: Extending Fairness in Single-Cell AI: Evaluating Age, Sex, and Ethnicity Bias with Synthetic Data Augmentation, October 2025, ACM (Association for Computing Machinery),
DOI: 10.1145/3765612.3767761.
You can read the full text:
Contributors
The following have contributed to this page







