All Stories

  1. SmartGraph: A Framework for Graph Processing in Computational Storage
  2. FAAStloop: Optimizing Loop-Based Applications for Serverless Computing
  3. Thorough Characterization and Analysis of Large Transformer Model Training At-Scale
  4. Thorough Characterization and Analysis of Large Transformer Model Training At-Scale
  5. Minimizing Coherence Errors via Dynamic Decoupling
  6. Thorough Characterization and Analysis of Large Transformer Model Training At-Scale
  7. Impact of Write-Allocate Elimination on Fujitsu A64FX
  8. MBFGraph: An SSD-based External Graph System for Evolving Graphs
  9. Hardware Support for Constant-Time Programming
  10. Quantifying and Mitigating Cache Side Channel Leakage with Differential Set
  11. Optimizing CPU Performance for Recommendation Systems At-Scale
  12. EdgePC: Efficient Deep Learning Analytics for Point Clouds on Edge Devices
  13. Cypress
  14. Multi-resource fair allocation for consolidated flash-based caching systems
  15. Fine-Granular Computation and Data Layout Reorganization for Improving Locality
  16. An architecture interface and offload model for low-overhead, near-data, distributed accelerators
  17. Skipper: Enabling efficient SNN training through activation-checkpointing and time-skipping
  18. Pushing Point Cloud Compression to the Edge
  19. End-to-end Characterization of Game Streaming Applications on Mobile Platforms
  20. Memory Space Recycling
  21. Data Convection
  22. Data Convection
  23. End-to-end Characterization of Game Streaming Applications on Mobile Platforms
  24. Memory Space Recycling
  25. A Scheduling Framework for Decomposable Kernels on Energy Harvesting IoT Edge Nodes
  26. End-to-end Characterization of Game Streaming Applications on Mobile Platforms
  27. Memory Space Recycling
  28. Data Convection
  29. Kraken
  30. SpecSafe: detecting cache side channels in a speculative world
  31. Mix and Match: Reorganizing Tasks for Enhancing Data Locality
  32. Distance-in-time versus distance-in-space
  33. Fluid: a framework for approximate concurrency via controlled dependency relaxation
  34. Mix and Match: Reorganizing Tasks for Enhancing Data Locality
  35. Mix and Match: Reorganizing Tasks for Enhancing Data Locality
  36. Ghost Thread
  37. SplitServe
  38. Fifer
  39. Implications of Public Cloud Resource Heterogeneity for Inference Serving
  40. Minimal Variance Sampling with Provable Guarantees for Fast Training of Graph Neural Networks
  41. Déjà View: Spatio-Temporal Compute Reuse for‘ Energy-Efficient 360° VR Video Streaming
  42. Computing with Near Data
  43. Quantifying Data Locality in Dynamic Parallelism in GPUs
  44. Architecture-Aware Approximate Computing
  45. Architecture-Aware Approximate Computing
  46. Co-optimizing memory-level parallelism and cache-level parallelism
  47. Quantifying Data Locality in Dynamic Parallelism in GPUs
  48. NEOFog
  49. ReveNAND
  50. Enhancing computation-to-core assignment with physical location information
  51. Enhancing computation-to-core assignment with physical location information
  52. Exploiting Data Longevity for Enhancing the Lifetime of Flash-based Storage Class Memory
  53. Exploiting Data Longevity for Enhancing the Lifetime of Flash-based Storage Class Memory
  54. A Study on Performance and Power Efficiency of Dense Non-Volatile Caches in Multi-Core Systems
  55. Hardware-Software Co-design to Mitigate DRAM Refresh Overheads
  56. Exploiting Intra-Request Slack to Improve SSD Performance
  57. Exploiting Intra-Request Slack to Improve SSD Performance
  58. Hardware-Software Co-design to Mitigate DRAM Refresh Overheads
  59. VIP
  60. A case for core-assisted bottleneck acceleration in GPUs
  61. Anatomy of GPU Memory System for Multi-Application Execution
  62. Optimizing off-chip accesses in multicores
  63. EECache
  64. Memory Row Reuse Distance and its Role in Optimizing Application Performance
  65. Memory Row Reuse Distance and its Role in Optimizing Application Performance
  66. VIP
  67. A case for core-assisted bottleneck acceleration in GPUs
  68. Network footprint reduction through data access and computation placement in NoC-based manycores
  69. Optimizing off-chip accesses in multicores
  70. TaPEr
  71. Volatile STT-RAM Scratchpad Design and Data Allocation for Low Energy
  72. Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications
  73. Trading cache hit rate for memory performance
  74. Orchestrated scheduling and prefetching for GPGPUs
  75. Performance enhancement under power constraints using heterogeneous CMOS-TFET multicores
  76. Physically addressed queueing (PAQ)
  77. A compiler framework for extracting superword level parallelism
  78. A compiler framework for extracting superword level parallelism