What is it about?

AI model inference demands significant computational power, making efficient resource use on embedded devices critical. Our solution is a thread-level stream scheduling method. Leveraging the unified memory management on the NVIDIA Jetson Xavier NX, it binds threads to CUDA streams to enhance GPU utilization through parallel scheduling. This approach significantly boosts both the throughput and speed of model inference at the edge.

Featured Image

Why is it important?

To deploy AI models on edge devices, most existing compilation frameworks focus on general-purpose model optimizations, often overlooking the specific architectural traits of embedded boards. Our work identifies a key issue: model compression can lead to low on-chip resource utilization. To address this, we introduce a thread-level CUDA stream scheduling method that significantly boosts GPU utilization, thereby increasing model throughput and inference speed. This research contributes to both edge AI deployment and compiler design, demonstrating a path to lower the cost and energy consumption of AI services through more efficient hardware use.

Perspectives

Edge AI is poised to enhance intelligent capabilities across all industries. Future model deployment frameworks will be able to efficiently optimize models by leveraging specific hardware features through methods like model architecture optimization, kernel fusion, memory management, and computational stream scheduling. With the support of such frameworks, AI models can be deployed rapidly, boosting efficiency and improving quality of life.

yijie chen
Northwest A&F University

Read the Original

This page is a summary of: A Thread-level Stream Scheduling Method for Accelerating LVMs' Inference on a Resource-constrained Platform, ACM Transactions on Embedded Computing Systems, October 2025, ACM (Association for Computing Machinery),
DOI: 10.1145/3771550.
You can read the full text:

Read

Contributors

The following have contributed to this page