Graph based Incident Extraction and Diagnosis in Large-Scale Online Systems

Zilong He; Pengfei Chen; Yu Luo; Qiuyu Yan; Hongyang Chen; Guangba Yu; Fangyuan Li

doi:10.1145/3551349.3556904

What is it about?

With the ever increasing scale and complexity of online systems, incidents are gradually becoming commonplace. Without appropriate handling, they can seriously harm the system availability. However, in large-scale online systems, these incidents are usually drowning in a slew of issues (i.e., something abnormal, while not necessarily an incident), rendering them difficult to handle. Typically, these issues will result in a cascading effect across the system, and a proper management of the incidents depends heavily on a thorough analysis of this effect. Therefore, we design a method to automatically analyze the cascading effect of availability issues in online systems and extract the corresponding graph based issue representations incorporating both of the issue symptoms and affected service attributes. The extracted representations facilitate incident detection and diagnosis. Our approach is successfully deployed in Wechat, the largest instant message system in China, and greatly eases the burden of operators in fighting against incidents in practice.

Photo by Gwendal Bar on Unsplash

Why is it important?

Without timely and appropriate management, incidents can quickly result in a great economic loss and a serious decrease of user experience. Understanding and utilizing useful information for the detection and diagnosis of incidents is crucial for companies managing large-scale online systems.

This page is a summary of: Graph based Incident Extraction and Diagnosis in Large-Scale Online Systems, October 2022, ACM (Association for Computing Machinery),
DOI: 10.1145/3551349.3556904.
You can read the full text:

Read

Contributors

The following have contributed to this page

Fighting against Incidents in Large-Scale Online Systems

What is it about?

Why is it important?

Contributors

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management

Fighting against Incidents in Large-Scale Online Systems

What is it about?

Featured Image

Why is it important?

Read the Original

Contributors

Share this page:

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management