What is it about?
With the ever increasing scale and complexity of online systems, incidents are gradually becoming commonplace. Without appropriate handling, they can seriously harm the system availability. However, in large-scale online systems, these incidents are usually drowning in a slew of issues (i.e., something abnormal, while not necessarily an incident), rendering them difficult to handle. Typically, these issues will result in a cascading effect across the system, and a proper management of the incidents depends heavily on a thorough analysis of this effect. Therefore, we design a method to automatically analyze the cascading effect of availability issues in online systems and extract the corresponding graph based issue representations incorporating both of the issue symptoms and affected service attributes. The extracted representations facilitate incident detection and diagnosis. Our approach is successfully deployed in Wechat, the largest instant message system in China, and greatly eases the burden of operators in fighting against incidents in practice.
Featured Image
Photo by Gwendal Bar on Unsplash
Why is it important?
Without timely and appropriate management, incidents can quickly result in a great economic loss and a serious decrease of user experience. Understanding and utilizing useful information for the detection and diagnosis of incidents is crucial for companies managing large-scale online systems.
Read the Original
This page is a summary of: Graph based Incident Extraction and Diagnosis in Large-Scale Online Systems, October 2022, ACM (Association for Computing Machinery),
DOI: 10.1145/3551349.3556904.
You can read the full text:
Contributors
The following have contributed to this page