All Stories

  1. A Survey on Failure Analysis and Fault Injection in AI Systems
  2. L4: Diagnosing Large-scale LLM Training Failures via Automated Log Analysis
  3. COCA: Generative Root Cause Analysis for Distributed Systems with Code Knowledge
  4. Mint: Cost-Efficient Tracing with All Requests Collection via Commonality and Variability Analysis
  5. FaaSConf: QoS-aware Hybrid Resources Configuration for Serverless Workflows
  6. CTuner: Automatic NoSQL Database Tuning with Causal Reinforcement Learning
  7. ChangeRCA: Finding Root Causes from Software Changes in Large Online Systems
  8. TraStrainer: Adaptive Sampling for Distributed Traces with System Runtime State
  9. Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-modal Observability Data
  10. DiagConfig: Configuration Diagnosis of Performance Violations in Configurable Software Systems
  11. MARS: Fault Localization in Programmable Networking Systems with Low-cost In-Band Network Telemetry
  12. DeepPower: Deep Reinforcement Learning based Power Management for Latency Critical Applications in Multi-core Systems
  13. LogReducer: Identify and Reduce Log Hotspots in Kernel on the Fly
  14. Fighting against Incidents in Large-Scale Online Systems
  15. MicroRank: End-to-End Latency Issue Localization with Extended Spectrum Analysis in Microservice Environments