Prediction of the Resource Consumption of Distributed Deep Learning Systems

Gyeongsik Yang; Changyong Shin; Jeunghwan Lee; Yeonho Yoo; Chuck Yoo

doi:10.1145/3530895

What is it about?

This research introduces Driple, a novel approach using graph neural networks (GNNs) to predict the resource consumption of diverse workloads in distributed deep learning systems. The study addresses the crucial challenge of accurately estimating resource needs for training deep learning models, considering various settings like GPU types and workloads. Unlike previous attempts, Driple accommodates a wide range of combinations of settings and workloads. By leveraging GNNs and transfer learning, Driple efficiently predicts resource consumption, including GPU utilization, memory usage, and network throughput, contributing to better management of training time and associated costs.

Photo by julien Tromeur on Unsplash

Why is it important?

In the context of the rapidly evolving field of deep learning, where models are becoming larger and more complex, predicting resource consumption is vital. Driple addresses the uncertainties users face when configuring execution settings for distributed training, such as GPU types and the number of GPUs. Existing challenges stem from the diverse choices in training settings (devices, parameter servers, etc.) and the variability in workloads (models, datasets, hyperparameters). Driple's contribution lies in its ability to accurately predict resource consumption for a broad spectrum of scenarios, aiding users in optimizing their deep learning model training and reducing the time and effort required to tailor predictions for different settings. Key Takeaway: Driple, utilizing graph neural networks, effectively predicts resource consumption for a variety of distributed deep learning workloads and settings. It introduces transfer learning to adapt predictions to different settings, significantly reducing the time needed for tailored predictions. With its ability to predict GPU utilization, memory usage, and network throughput, Driple offers a comprehensive solution for users to estimate resource needs, ultimately enhancing the efficiency and cost-effectiveness of distributed deep learning model training.

Perspectives

I believe this paper lays the groundwork for a number of potential areas of further study. One way ML model practitioners might reduce infrastructure expenditures is by leveraging predictions about GPU or GPU memory utilization. This would help determine the optimal number and type of GPUs for each training task. Furthermore, based on these predictions, cloud infrastructure and operating system engineers can enhance system-side decisions, such as resource scheduling, job placement on nodes, and migration. Please feel free to share your thoughts on this topic or suggest any other areas for collaboration.
Gyeongsik Yang
Korea University

Some of the content on this page has been created using generative AI.

This page is a summary of: Prediction of the Resource Consumption of Distributed Deep Learning Systems, Proceedings of the ACM on Measurement and Analysis of Computing Systems, May 2022, ACM (Association for Computing Machinery),
DOI: 10.1145/3530895.
You can read the full text:

Read

Resources

Related Content
Artificial Intelligence Showcase
More plain language summaries relevant to Artificial Intelligence

Contributors

The following have contributed to this page

Gyeongsik Yang
Korea University

Predict deep learning training through graph neural networks

What is it about?

Why is it important?

Perspectives

Resources

Artificial Intelligence Showcase

Contributors

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management

Predict deep learning training through graph neural networks

What is it about?

Featured Image

Why is it important?

Perspectives

AI notice

Read the Original

Resources

Artificial Intelligence Showcase

Contributors

Share this page:

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management