3 minute read

Find and fix IT glitches before they crash the system

IBM infuses its AIOps Insights platform with generative AI on watsonx for faster, more accurate remediation of IT incidents.

IBM RESEARCH_BlogPost_sep28-12.jpg

IBM infuses its AIOps Insights platform with generative AI on watsonx for faster, more accurate remediation of IT incidents.

IT failures can be costly. Even short outages can add up to millions of dollars in lost business, and the price tag for these failures is growing as more customers handle their business online. Fortunately, AI is helping IT teams stay ahead of these potentially catastrophic disruptions.

For years, IBM’s IT Automation portfolio has allowed IT experts to identify incidents early, before they escalate. IBM Research is now enhancing the latest version of its AIOps Insights platform with a new set of intelligent remediation capabilities that make finding the most accurate solution to an IT issue faster and easier.

IBM AIOps Insights will soon harness the power of large language models (LLMs) and generative AI through IBM’s next-generation AI platform watsonx. It will summarize the incident, identify a probable cause, and guide teams through a set of AI-recommended remediation steps. This end-to-end solution has the potential to dramatically improve response times and outcomes.

Today, IBM AIOps Insights gathers data from the client’s IT environment and looks for correlations in the data to identify potential issues. If an incident is uncovered, the IT operations expert on duty — let’s call her Katie in the following example — is notified.

IBM AIOps Insights has just detected that 5% of users on Katie’s enterprise application is experiencing a slowdown that could jeopardize a mission-critical workload. Fault localization identifies the data storage application, Redis, as the source of the bottleneck. IBM AIOps Insights points Katie to Redis, and helps her find the probable cause of the incident.

Diagnosing the cause of the incident

An intelligent remediation module in IBM’s new version of AIOps Insights speeds up the search, guiding her to the cause of the slowdown and helping her to quickly fix it before mission-critical workloads are interrupted.

Using AI, reinforcement learning, and causal analysis, AIOps Insights takes streams of monitoring data and comes up with hypotheses for what may have gone wrong with Redis. AIOps Insights then probes for more information and analyzes it, ruling out each hypothesis until it identifies the most likely cause of the slowdown: high CPU usage on Redis, something called CPU cache thrashing.

From this validated diagnosis, AIOps Insights will soon draw, in real time, on a 13-billion parameter LLM from IBM's Granite family, trained on watsonx by IBM Research. The model will provide a summary of the incident along with a list of observations supporting the diagnosis for Katie to verify if needed.

A recommended course of action

With the help of AI, Katie has now traced the probable cause of the slowdown to the CPU. Her next step is to figure out how to restore the system. This is where retrieval-augmented generation (RAG), an AI framework for retrieving facts from a live knowledge base, comes in.

Here, AIOps Insights leverages RAG to find and generate the most accurate, up-to-date recommendations for fixing IT issues. AIOps Insights calls on the watsonx LLM to pull information from an online database of tips for troubleshooting issues tied to containerized workloads on platforms like Kubernetes or Docker.

The knowledge base is used at run-time to provide Katie with a quick action recommendation and an executable script: cordon the node, scale the replica-set, delete the Redis pods on the faulty node so the Redis stateful will create new pods, and scale down the Redis statefulset to the number of original replicas.

Normally, Katie would write her own script to carry out the suggested remediation. However, the action recommendation module makes it possible to complete the job faster and with greater accuracy by putting the most relevant knowledge at her fingertips.

Katie validates, compiles, and executes the recommended script. Within seconds, the system returns to normal. An outage, or more serious downstream crisis, has been averted. The results of this incident can be added to a runbook that, once approved, can be reused if a similar issue should arise.

Through intelligent remediation with watsonx, IBM AIOps Insights can now help IT experts identify the most likely cause of an IT glitch and address it before it turns into a long and costly disruption.