What is it about?

In many scientific applications, matrix-matrix multiplication (known as GEMM) and matrix-vector multiplication (known as GEMV) are two important operations. AI also relies heavily on these two kernels, and their increase in popularity makes them more important to execute quickly. Modern supercomputers are using both CPUs and GPUs more and more in their systems. Typically, it is thought that GEMM is executed quicker on a GPU, and GEMV is executed quicker on a CPU. This work tests whether this is the actually the case on three new supercomputers that all use different CPUs and GPUs for a large range of matrix and vector sizes and shapes (i.e. square and rectuangular matrices). Using a newly developed benchmark, it shows that the hardware a supercomputer has, how the CPU and GPU are connected to each other, what GEMM and GEMV library is used, how many back-to-back GEMM or GEMV operations are done, and the shape of the matrices all impact the problem size at which doing the operations on the GPU is quicker. Therefore, (depending on the CPU and GPU) there is no "one size fits all solution" on whether the CPU or GPU should be used.

Featured Image

Why is it important?

The majority of supercomputers only have CPUs, but an increasing number now have GPUs too and this is starting to become the norm. How the CPU and GPU is connected is also changing with more System-on-Chip (SoC) solutions being used. These changes can drastically reduce the overhead of using a GPU so utilising them may be worthwhile in more situations than before. AI is also becoming more common and has an increasing demand. Given its reliance on GEMM and GEMV, minimising the execution time of these operations helps a lot in reducing the overall time to solution. However, even for non-AI applications, each problem is unique in terms of how big the matrices are and how many GEMM or GEMV operations are performed. Whilst GPUs are typically good for these workloads, it is not always the case so a benchmark which can help to determine if a GPU is worth using is of great benefit. Additionally, for older applications that use GEMM and/or GEMV operations, changing the code to run on a GPU as opposed to just a CPU takes time and effort and is generally more complex. If using a GPU does not provide a performance benefit then this code porting effort would be a waste of time and resources. So, knowing in advance for a specific supercomputer and GEMM/GEMV library if changing your code is worthwhile is greatly beneficial.

Perspectives

Doing this study, it was really interesting so see the drastic performance differences between similar supercomputers that use hardware and software from different vendors, even though the underlying technology is very similar. The immense performance of NVIDIA's Grace-Hopper superchip SoC (used in one of the supercomputer this study evaluated) was also really impressive to see, with the GPU eclipsing CPU performance in vast majority of the situations we tested.

Finn Wilkinson
University of Bristol

Read the Original

This page is a summary of: Assessing the GPU Offload Threshold of GEMM and GEMV Kernels on Modern Heterogeneous HPC Systems, November 2024, Institute of Electrical & Electronics Engineers (IEEE),
DOI: 10.1109/scw63240.2024.00188.
You can read the full text:

Read

Contributors

The following have contributed to this page