What is it about?

Data preprocessing consisting of tasks like sample resizing, cropping, and filtering, is a crucial step in machine learning (ML) workflows. Even though the preprocessing step is largely ignored by work that focuses on optimizing training algorithms, in practice for many workloads preprocessing and training are pipelined. Popular ML frameworks like PyTorch use data loaders to feed data into model training. If the pipeline between preprocessing and training is not done carefully, it can cause significant waiting times on the GPU side. To address this limitation, we introduce SpeedyLoader, a system that overlaps preprocessing and training by leveraging asynchronous data preprocessing and avoiding head-of-line blocking. SpeedyLoader incorporates dedicated data loading threads, which organize preprocessed samples into queues based on their predicted processing times. Concurrently, GPUs fetch samples from these queues, ensuring training is not impeded by preprocessing completion.

Featured Image

Why is it important?

Compared to the default PyTorch DataLoader, SpeedyLoader reduces training time by up to 30% and increases GPU usage by 4.3×, all while maintaining a consistent evaluation accuracy of 91%

Perspectives

SpeedyLoader is a first step towards efficient pipelining of preprocessing and training. Looking ahead, our goals include achieving 100% GPU usage by further optimizing the implementation of the data loader (e.g., integrating a custom queue with finer-grained locking). We also plan to extend our study of workloads and heuristics for data preprocessing times beyond computer vision (e.g., large language models, recommendation systems). Finally, we plan to extend SpeedyLoader to support collocation of different workloads.

Rahma Nouaji
McGill University

Read the Original

This page is a summary of: SpeedyLoader, April 2024, ACM (Association for Computing Machinery),
DOI: 10.1145/3642970.3655824.
You can read the full text:

Read

Contributors

The following have contributed to this page