Learning Library

← Back to Library

AIOps: Preventing Downtime Costs

5m • Unknown Channel • ai-ml • tutorial • intermediate • Watch on YouTube ↗

Key Points

Unplanned IT downtime can cost businesses millions, damage their brand, and even trigger regulatory penalties.
AIOps (Artificial Intelligence for Operations) leverages AI, machine learning, and advanced analytics on operational data to give IT teams faster, data‑driven decision‑making power.
In a typical outage—like an invoicing app failing for a real‑estate firm—AIOps helps pinpoint the problem and accelerate restoration.
The platform ingests key sources (events, metrics, logs, alerts), creates baselines and thresholds, and defines what “normal” looks like for the application.
It then contextualizes and surfaces actionable insights through collaborative tools, enabling operators and SREs to engage, troubleshoot, and restore service efficiently.

Sections

Full Transcript

# AIOps: Preventing Downtime Costs **Source:** [https://www.youtube.com/watch?v=XbYKAJc5jhg](https://www.youtube.com/watch?v=XbYKAJc5jhg) **Duration:** 00:05:31 ## Summary - Unplanned IT downtime can cost businesses millions, damage their brand, and even trigger regulatory penalties. - AIOps (Artificial Intelligence for Operations) leverages AI, machine learning, and advanced analytics on operational data to give IT teams faster, data‑driven decision‑making power. - In a typical outage—like an invoicing app failing for a real‑estate firm—AIOps helps pinpoint the problem and accelerate restoration. - The platform ingests key sources (events, metrics, logs, alerts), creates baselines and thresholds, and defines what “normal” looks like for the application. - It then contextualizes and surfaces actionable insights through collaborative tools, enabling operators and SREs to engage, troubleshoot, and restore service efficiently. ## Sections - [00:00:00](https://www.youtube.com/watch?v=XbYKAJc5jhg&t=0s) **AIOps: Preventing Costly Downtime** - The speaker highlights the multi‑million‑dollar, brand‑damage, and regulatory risks of unexpected IT outages before defining AIOps as the application of AI, machine learning, and advanced analytics to operational data to empower IT professionals to detect, diagnose, and remediate incidents quickly. - [00:03:06](https://www.youtube.com/watch?v=XbYKAJc5jhg&t=186s) **AIOps-Driven ChatOps Incident Resolution** - The passage outlines how AIOps ingests logs, provides contextual insights via chat‑ops to SREs, and enables automated script execution to quickly resolve incidents. ## Full Transcript

0:00So what can a few minutes of unplanned downtime cost a business? Six or seven figures in lost 0:05revenue? A damaged brand? Regulatory action? Hi, my name is Albert Traylor and I'm with IBM 0:13Cloud, and today I'm here to talk about artificial intelligence for operations, also known as AIOps. 0:23So before we go into the definition of AIOps, let's think about a scenario. 0:27Imagine that you are an it operations professional 0:31at a successful real estate company. Let's call that company "Housing For All." 0:38At h4a you support a portfolio of applications one of which is an invoicing application used by 0:45tens of thousands of partners every day. For this specific application your focus is to make sure 0:53it's up and that partners can deliver invoices consistently on it on a regular basis. 0:59One day you settle into your desk you get a cup of coffee and you get a phone call, 1:04just like that. Out of nowhere a sales rep is calling to complain that a partner is trying 1:09to upload an invoice and has been unable to all morning because the application is down. 1:15What do you do in that scenario to get this application back up and running? 1:19So before we continue down that scenario let's talk about AIOps and the textbook definition. 1:27Artificial intelligence for operations is about the application of artificial intelligence 1:32machine learning models and advanced analytics to IT operational data. The objective 1:39is to empower it professionals and operations professionals with the data they need to make 1:45decisions and ultimately resolve and restore service to an application 1:50faster. So with that definition in mind, let's talk about how we can get this invoice application 1:58back up and running. Now let's think about the key data sources for our invoicing application. 2:06Let's start with events, and metrics, 2:13logs, alerts, and a few others but for now let's focus on these data sources 2:22for our specific invoicing application. In the real world this will look different depending on your 2:27application architecture, the type of data sources that are applicable for your application, and 2:32data regulatory requirements. So we've got our key data sources, now how does this fit into our model 2:38for AIOps? Let's think about three key steps. The first one we're going to call monitor and discover. 2:52So in this first step the data is ingested by the AIOps platform, 2:56and is thresholded and creates baselines for your specific application. 3:02So let's think about it this way what is normal for my invoicing application? 3:06What is the log ingestion rate, how many errors is acceptable based on our slo, or service level 3:12objective? So we now have that information, the next step of the process is about engage and context. 3:28So the next step is engage in context. So this is where AIOps really shines 3:34it takes all of that ingested data, and it surfaces it to an IT operations professional 3:40or site reliability engineer in the form of a collaboration solution, also known as chat ops. 3:46So up until now everything's been done in the background and as soon as this incident pops 3:50up with our invoicing application it surfaced via chat ops to our site reliability engineer. 3:56Less is more here they have the context on where the incident is located in the application, 4:01what specific actions are recommended to resolve this, and most importantly, 4:06how are those actions based on incidents like this that have come up in the past? 4:11So now our sre is armed with that information, and it's our last phase which is act and automate. 4:24So everything's been done in the background, this information has been surfaced to our IT officer 4:30sre, ITOps professional sre, and then finally we have to act and automate to resolve this issue. 4:38The suggested options that are available via chat ops enable the ITOps to select what has 4:43worked in the past and with a click activate a script or run book to resolve this issue 4:48as soon as it's detected. This gets our invoicing application back up and running faster 4:53and make sure that our partners are happier with this experience. 4:58So now we've got the overview, three key steps, the type of data that's ingested by AIOps. 5:04In summary this system allows its professionals to solve problems faster, 5:09to keep applications up and running, and help protect the business in the long run. 5:14Thank you. If you have any questions please drop us a line below. If you want to see more videos like 5:20this in the future, please like and subscribe, and don't forget you can grow your skills and earn a 5:25and earn a badge with IBM CloudLabs, which are free browser based interactive Kubernetes labs.