What is it about?

As CPUs continue to be a preferred platform for running DLRM inference workloads, improving the performance of these emerging workloads on single and multicore CPUs becomes critical. In this context, optimizing the embedding stage, which is the major bottleneck in the DLRM pipeline, is especially important. While prior efforts have proposed several techniques to enhance DLRM performance primarily through heterogeneous CPU+GPU platforms and NMP accelerators, this paper focuses on improving the performance on state-of-the-art CPUs by utilizing well-known software techniques. We propose customized software prefetching and computation overlapping via hyperthreading to minimize the embedding stage overhead and overall execution latency. Our experimental results show that the proposed techniques improve the performance of embedding heavy & mixed models for various hotness datasets by up to 1.59x. We believe that our techniques can be implemented straightforwardly in CPUs for making them a competitive platform for DLRM inference. Finally, with these two enhancements, requiring no change in hardware or in the models publicly distributed, we are getting on an average 40% improvement.

Featured Image

Why is it important?

We do not claim novelty in running DLRMs on CPUs. As several prior studies note, CPUs are becoming increasingly competitive to take on these workloads opportunistically in a datacenter with high GPU demand. Prior published studies with DLRMs on CPUs/GPUs have used datasets/models that are smaller and experienced more skewed access patterns. We explore larger models with a wide range of access patterns, which revealed that the (temporal) data locality can get much worse than what prior work assumed. This in turn can make some previously proposed techniques less effective. Instead, what we show is that it is more important to either (i) reduce the latency for off-chip accesses due to LLC misses (using prefetching) or (ii) tolerate the latency of off-chip accesses (using multithreading), or (iii) possibly both, for these newer datasets. Similarly, prefetching and hyperthreading are themselves not new and we do not claim that we are first to use them for DLRMs – unpublished industrial production systems might be already using them to a certain extent. However, there is no published research, to our knowledge, that systematically studied how/when to prefetch or do multithreading intelligently for DLRM workloads. With respect to prefetching, we show that simply employing a state-of-the-art hardware or software prefetcher does not work very well. Instead, leveraging application knowledge of when and how the embedding tables are accessed is important to initiate and fine-tune prefetches. As for hyperthreading, some prior works show this to be detrimental to DLRMs. In contrast, we show that DLRMs can benefit from hyperthreading, if new threads are spawned and grouped together smartly (based on application knowledge) for maximum effectiveness rather than letting the underlying system do the job, which we have not seen in any prior study. Finally, the two proposals complement each other, producing benefits better than the sum of the parts, which is again a new observation. This is because prefetching helps in freeing CPU pipeline resources, avoiding issues like full window stalls, thereby allowing sibling thread to avail more resources.

Perspectives

We have open source the codebase and DLRM datasets at: https://github.com/rishucoding/reproduce_isca23_cpu_DLRM_inference. This should help other readers and enthusiasts to replicate our work and use it for future research goals.

Rishabh Jain
Pennsylvania State University

Read the Original

This page is a summary of: Optimizing CPU Performance for Recommendation Systems At-Scale, June 2023, ACM (Association for Computing Machinery),
DOI: 10.1145/3579371.3589112.
You can read the full text:

Read

Resources

Contributors

The following have contributed to this page