All Stories

  1. Thorough Characterization and Analysis of Large Transformer Model Training At-Scale
  2. Thorough Characterization and Analysis of Large Transformer Model Training At-Scale
  3. Minimizing Coherence Errors via Dynamic Decoupling
  4. Thorough Characterization and Analysis of Large Transformer Model Training At-Scale
  5. Impact of Write-Allocate Elimination on Fujitsu A64FX
  6. MBFGraph: An SSD-based External Graph System for Evolving Graphs
  7. Hardware Support for Constant-Time Programming
  8. Quantifying and Mitigating Cache Side Channel Leakage with Differential Set
  9. Optimizing CPU Performance for Recommendation Systems At-Scale
  10. EdgePC: Efficient Deep Learning Analytics for Point Clouds on Edge Devices
  11. Cypress
  12. Multi-resource fair allocation for consolidated flash-based caching systems
  13. Fine-Granular Computation and Data Layout Reorganization for Improving Locality
  14. An architecture interface and offload model for low-overhead, near-data, distributed accelerators
  15. Skipper: Enabling efficient SNN training through activation-checkpointing and time-skipping
  16. Pushing Point Cloud Compression to the Edge
  17. End-to-end Characterization of Game Streaming Applications on Mobile Platforms
  18. Memory Space Recycling
  19. Data Convection
  20. Data Convection
  21. End-to-end Characterization of Game Streaming Applications on Mobile Platforms
  22. Memory Space Recycling
  23. A Scheduling Framework for Decomposable Kernels on Energy Harvesting IoT Edge Nodes
  24. End-to-end Characterization of Game Streaming Applications on Mobile Platforms
  25. Memory Space Recycling
  26. Data Convection
  27. Kraken
  28. SpecSafe: detecting cache side channels in a speculative world
  29. Mix and Match: Reorganizing Tasks for Enhancing Data Locality
  30. Distance-in-time versus distance-in-space
  31. Fluid: a framework for approximate concurrency via controlled dependency relaxation
  32. Mix and Match: Reorganizing Tasks for Enhancing Data Locality
  33. Mix and Match: Reorganizing Tasks for Enhancing Data Locality
  34. Ghost Thread
  35. SplitServe
  36. Fifer
  37. Implications of Public Cloud Resource Heterogeneity for Inference Serving
  38. Minimal Variance Sampling with Provable Guarantees for Fast Training of Graph Neural Networks
  39. Déjà View: Spatio-Temporal Compute Reuse for‘ Energy-Efficient 360° VR Video Streaming
  40. Computing with Near Data
  41. Quantifying Data Locality in Dynamic Parallelism in GPUs
  42. Architecture-Aware Approximate Computing
  43. Architecture-Aware Approximate Computing
  44. Co-optimizing memory-level parallelism and cache-level parallelism
  45. Quantifying Data Locality in Dynamic Parallelism in GPUs
  46. NEOFog
  47. ReveNAND
  48. Enhancing computation-to-core assignment with physical location information
  49. Enhancing computation-to-core assignment with physical location information
  50. Exploiting Data Longevity for Enhancing the Lifetime of Flash-based Storage Class Memory
  51. Exploiting Data Longevity for Enhancing the Lifetime of Flash-based Storage Class Memory
  52. A Study on Performance and Power Efficiency of Dense Non-Volatile Caches in Multi-Core Systems
  53. Hardware-Software Co-design to Mitigate DRAM Refresh Overheads
  54. Exploiting Intra-Request Slack to Improve SSD Performance
  55. Exploiting Intra-Request Slack to Improve SSD Performance
  56. Hardware-Software Co-design to Mitigate DRAM Refresh Overheads
  57. VIP
  58. A case for core-assisted bottleneck acceleration in GPUs
  59. Anatomy of GPU Memory System for Multi-Application Execution
  60. Optimizing off-chip accesses in multicores
  61. EECache
  62. Memory Row Reuse Distance and its Role in Optimizing Application Performance
  63. Memory Row Reuse Distance and its Role in Optimizing Application Performance
  64. VIP
  65. A case for core-assisted bottleneck acceleration in GPUs
  66. Network footprint reduction through data access and computation placement in NoC-based manycores
  67. Optimizing off-chip accesses in multicores
  68. TaPEr
  69. Volatile STT-RAM Scratchpad Design and Data Allocation for Low Energy
  70. Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications
  71. Trading cache hit rate for memory performance
  72. Orchestrated scheduling and prefetching for GPGPUs
  73. Performance enhancement under power constraints using heterogeneous CMOS-TFET multicores
  74. Physically addressed queueing (PAQ)
  75. A compiler framework for extracting superword level parallelism
  76. A compiler framework for extracting superword level parallelism