What is it about?
This paper introduces TinyServe, a cache-aware system that learns which parts of a language-model query should be stored and reused. By selecting only the most valuable cached responses, it speeds up large-language-model inference while reducing memory and energy use.
Featured Image
Photo by Markus Spiske on Unsplash
Why is it important?
Serving large language models is expensive and slow. Our method cuts cost and latency without losing accuracy, helping researchers and companies deploy AI more sustainably.
Perspectives
AI system engineers, data-center architects, and researchers developing efficient foundation-model infrastructure.
Yanxuan Yu
Columbia University
Read the Original
This page is a summary of: TinyServe: Query-Aware Cache Selection for Efficient LLM Serving, October 2025, ACM (Association for Computing Machinery),
DOI: 10.1145/3746027.3758181.
You can read the full text:
Resources
Contributors
The following have contributed to this page







