The complexity of today’s distributed microservices applications makes it tough to track down the root cause when a problem occurs. The time-proven

Using GPT-3 for root cause incident summarization of incidents

submited by
Style Pass
2021-09-09 18:30:02

The complexity of today’s distributed microservices applications makes it tough to track down the root cause when a problem occurs. The time-proven method of drilling down on monitoring dashboards and then digging into logs simply takes too long. Hunting through huge volumes of logs is tedious and interpreting them is difficult. It also requires an enormous amount of skill and experience to understand what the logs mean and to identify the significant factors that relate to the root cause. Worse, this kind of approach ties up the most critical engineering and DevOps resources, preventing them from doing something that could be more valuable to the business. 

It’s no wonder machine learning (ML) applied to logs is gaining momentum. It turns out that when an application problem occurs, the patterns in the logs will change in a noticeable way. Using the right approach, the ML can find these anomalous patterns and distill them into a small sequence of log lines that explain the root cause. Imagine the time savings of having to only review 20 log lines curated by the ML, instead of hunting through the many millions of log lines that were generated while the problem took place. Using ML on logs completely revolutionizes the troubleshooting process – speeding up incident resolution time and freeing up key engineers to work on new features instead of fighting fires.

While ML transforms the process of hunting through logs, it does not fully solve the challenge for all users. Even with the best machine learning techniques, there is a last mile problem: a skilled human with the right knowledge of the part of the application or infrastructure that has failed is normally required to interpret the log lines. Think of the possibilities if the reliance on key engineering resources could be eliminated by using AI to interpret those same log lines. 

Leave a Comment