Learning Library

← Back to Library

AI‑Driven Incident Resolution with Watson AIOps

Key Points

  • Faster, more frequent cloud deployments boost delivery speed but also increase incident volume and resolution time, straining IT operations and potentially upsetting customers.
  • Incident resolution is measured by metrics such as Mean Time to Resolution (MTTR), Mean Time to Fix (MTTF), and especially Mean Time to Identify (MTTI), which can vary widely depending on operator knowledge and system complexity.
  • IBM Cloud Pak for Watson AIOps leverages machine‑learning‑driven anomaly detection and unsupervised learning on multi‑source data (logs, PagerDuty, Splunk, ServiceNow, etc.) to automatically surface likely causes and cut MTTI without extensive model training.
  • The platform provides out‑of‑the‑box, pre‑trained models and intelligent search of prior run‑books to quickly resolve “lucky‑day” incidents, while for more complex “not‑so‑lucky” cases it auto‑generates incident summaries, hypothesizes impacted services, and groups change‑related events to speed root‑cause analysis.

Full Transcript

# AI‑Driven Incident Resolution with Watson AIOps **Source:** [https://www.youtube.com/watch?v=ph8p-eP9Y90](https://www.youtube.com/watch?v=ph8p-eP9Y90) **Duration:** 00:07:43 ## Summary - Faster, more frequent cloud deployments boost delivery speed but also increase incident volume and resolution time, straining IT operations and potentially upsetting customers. - Incident resolution is measured by metrics such as Mean Time to Resolution (MTTR), Mean Time to Fix (MTTF), and especially Mean Time to Identify (MTTI), which can vary widely depending on operator knowledge and system complexity. - IBM Cloud Pak for Watson AIOps leverages machine‑learning‑driven anomaly detection and unsupervised learning on multi‑source data (logs, PagerDuty, Splunk, ServiceNow, etc.) to automatically surface likely causes and cut MTTI without extensive model training. - The platform provides out‑of‑the‑box, pre‑trained models and intelligent search of prior run‑books to quickly resolve “lucky‑day” incidents, while for more complex “not‑so‑lucky” cases it auto‑generates incident summaries, hypothesizes impacted services, and groups change‑related events to speed root‑cause analysis. ## Sections - [00:00:00](https://www.youtube.com/watch?v=ph8p-eP9Y90&t=0s) **AI-Driven Incident Resolution Acceleration** - The speaker highlights that faster cloud deployments lengthen incident resolution times for ops teams, then demonstrates how IBM Cloud Pak for Watson AIOps leverages automation and machine‑learning‑based anomaly detection to shorten mean time to identify (MTTI) and overall mean time to resolution. - [00:03:06](https://www.youtube.com/watch?v=ph8p-eP9Y90&t=186s) **AI-Driven Incident Resolution** - The passage explains how Cloud Pak for Watson AIOps uses smart topology, AI, and NLP to summarize diagnostics, consolidate siloed data, surface similar tickets, and guide run‑book execution to resolve incidents efficiently. - [00:06:11](https://www.youtube.com/watch?v=ph8p-eP9Y90&t=371s) **Proactive Root Cause Analysis with AIOps** - The segment explains how correlating metrics—such as a spiking disk‑busy rate—helps pinpoint a database overload as the app’s slowdown cause, and how Cloud Pak for Watson AIOps automates this detection and resolution to boost operational efficiency. ## Full Transcript
0:00- First the good news: 0:02cloud development strategies mean more 0:03and faster deployments. 0:05Now the bad news: 0:07more and faster deployments can impact your IT operations 0:10team, increasing the time to resolve incidents. 0:13That potentially means unhappy customers and more resources 0:16dedicated to keeping your systems running smoothly. 0:19Hi, I'm Dan Kehn from IBM Cloud®. 0:22So what's your ops team to do? 0:24I'll cover that question in two quick demonstrations, 0:27but first let's briefly review the phases 0:29of incident resolution, then I'll explain how automation 0:32and AI can help. 0:36Mean time to resolution is the big picture. 0:38It covers everything from the time the problem has started 0:41until it's finally resolved. 0:43The longer it takes to resolve, 0:44the worse the impact to your organization. 0:47Some parts of resolution time are consistent, 0:50like mean time to fix, or MTTF. 0:53Others vary significantly, like mean time to identify, 0:56or MTTI, which can run from hours to days. 0:59That's because it relies on operators' experience 1:02and knowledge of the system relationships. 1:05To help lower MTTI, IBM Cloud Pak® for Watson AIOps 1:09comes with built-in anomaly identification strategies 1:11that use machine learning. 1:13Of course, machine learning works best when there's lots 1:16of varied, high-quality data. 1:18That's why Cloud Pak for Watson AIOps consumes data 1:21from many sources, it then uses AI to discover 1:24relationships across these different data sources 1:26and weigh possible causes. 1:29This unsupervised learning reduces the time needed 1:32to realize the value of AI, 1:34so instead of requiring extensive training, 1:36you can get started right away, 1:37with out-of-the-box, pre-trained models. 1:41Okay, with that intro out of the way, 1:43I'd like to walk you through two incidents and how 1:45Cloud Pak for Watson AIOps can help you resolve 1:48them more quickly. 1:49The first I call lucky day. 1:52With the smart search of prior solutions, 1:54you close the incident using documented steps in a run book. 1:58The second, which unfortunately happens more often 2:00than we'd like to admit, I call not-so-lucky day. 2:03This is an undiscovered problem that requires sifting 2:06through misbehaving servers and confirming the root cause. 2:10Imagine you're at lunch and a notification 2:12from Cloud Pak for Watson AIOps comes in. 2:15You double click to check it out. 2:18In the problem summary, you recognize one of the services 2:20you monitor, so you decide to investigate. 2:23Cloud Pak for Watson AIOps shows you a summary view 2:26based on data gathered directly from your app monitoring 2:28logs and from integrated tools like PagerDuty, Splunk, 2:31and ServiceNow. 2:33The chat entry shows you several different fields. 2:36First, the impacted application, 2:38the train ticket application. 2:40Next, a hypothesized storage of the problem, 2:42the ticket info service. 2:44It also shows you incident severity and status. 2:48Finally, you can see a summary ticket that was automatically 2:51generated by Watson AIOps. 2:54Two key questions for resolving a problem are what changed 2:57and what happened nearby? 2:59Cloud Pak for Watson AIOps groups events 3:01that represent change and a topology map 3:03that represents nearby connected services. 3:06That is what changed, how recently it changed, 3:09and how frequently it changed gives you hints 3:11about the source of the problem. 3:13With smart topology and an understanding of the context, 3:16you now know where to start. 3:18Okay, I've shown you how Cloud Pak for Watson AIOps 3:21provides a summary of key diagnostic information, 3:24but it also helps consolidate information 3:26from multiple tools across different data silos. 3:30This view shows a summary of the anomalies that underlie 3:33the problem report. 3:34This helps reduce information overload 3:37and avoid notification flooding. 3:39It also saves you from the hassle of chasing problems 3:41across different tools. 3:43Now that you have a better understanding of the incident, 3:45you take action to resolve it. 3:47Cloud Pak for Watson AIOps has identified similar tickets 3:50based on data interpreted with natural language processing 3:53and pre-trained AI models. 3:55This can help you quickly identify relevant tickets 3:58with possible solutions. 4:00By pinpointing specific actions that your team 4:02has taken in the past, you don't have to deal with the 4:04tedium of manually reviewing a list of prior tickets. 4:08You confirm the run book matches the current problem 4:10and resolve the incident. 4:12Excellent. 4:14And the prior investigation, we got lucky. 4:17The problem had been resolved once before, 4:19so we only had to execute a run book. 4:21But what if it wasn't so easy? 4:23And that's where AI and machine learning really shine. 4:26Cloud Pak for Watson AIOps consumes huge volumes 4:29of your system data, 4:30structured data like configuration topology, 4:33semi-structured data like logs and ticket information, 4:35even unstructured data like commit comments. 4:38Based on this data, it learns what normal looks like so 4:41it can alert you when metrics are outside expected bounds. 4:46But the metric manager in Cloud Pak for Watson AIOps 4:48doesn't rely on fixed threshold tracking. 4:50This avoids the trap where a high fixed threshold 4:53generates too few alerts and real problems 4:55are ignored until they become serious 4:57or low threshold generates too many alerts 4:59and your operators simply tune them out. 5:02Instead, Cloud Pak for Watson AIOps uses machine learning 5:05to understand what's normal behavior 5:07for key performance metrics and automatically sets 5:10adaptive threshold based on actual system experience. 5:13Now let's look at how Cloud Pak for Watson AIOps 5:16helps you with your not-so-lucky day. 5:18It's later in the week and you're handling another incident. 5:21This time it's a claims app and users are reporting 5:23really bad response times. 5:25You start your investigation by opening 5:27the events dashboard. 5:30Cloud Pak for Watson AIOps recognizes many data sources 5:33for event correlation. 5:34For example, data from Log DNA, Serves Now, PagerDuty 5:38and hundreds of other integrations. 5:41The dashboard groups related events based 5:43on inferred associations like topology, 5:46the time occurrence, and location. 5:48Let's take a closer look at the event leading up 5:49to the claims app slowing down. 5:53This metrics timeline can help you 5:54with problem determination and assess potential impacts. 5:58The green indicates normal behavior over time. 6:01You can visualize baseline performance compared 6:03to the recently-captured data. 6:07The secondary view shows metrics 6:09based on application observability. 6:11These are metrics discovered and identified 6:13as being related to the app's response time. 6:16Here we see the disk busy metric is unstable, 6:19let's add it to the timeline for further investigation. 6:23Now the primary review shows that just before 6:25the response time problems, the disk busy 6:27for the storage service went up 6:28to nearly 100% utilization and it stayed there. 6:32That's never good. 6:34Based on this brief analysis, you know the data store 6:36for the database was overloaded. 6:38It's a prime candidate for the root cause 6:40of the app's slowdown. 6:42The next step is confirming your analysis by checking 6:44the service logs and then proposing a proper solution. 6:48The analysis of the relationships between these metrics 6:50help you understand the full scope of the problem. 6:53Once the fix is delivered, you can say with confidence, 6:55you've identified and resolved the true cause. 6:58Okay, let's recap. 7:00When it comes to IT Ops, it's better to be proactive 7:03than constantly being forced into a reactive posture. 7:06With dynamically-determined rules, the data analysis 7:09by Cloud Pak for Watson AIOps helps you get 7:11to resolution quicker, potentially before your users 7:14even notice a problem. 7:15And you don't have to manage rules, 7:18consider how they interact with each other, 7:19or worry about how rules should change 7:21when the environment changes. 7:23What can automation mean to your company? 7:25How about 25% more time spent on work that drives 7:28your business or reducing manual labor costs by 50%? 7:32Thanks for watching. 7:34If you'd like to see more videos like this in the future, 7:36please click like and subscribe. 7:38If you want to learn more about Cloud Pak for Watson AIOps, 7:40make sure to check out the links in the description.