AIOps: Preventing Downtime Costs
Key Points
- Unplanned IT downtime can cost businesses millions, damage their brand, and even trigger regulatory penalties.
- AIOps (Artificial Intelligence for Operations) leverages AI, machine learning, and advanced analytics on operational data to give IT teams faster, data‑driven decision‑making power.
- In a typical outage—like an invoicing app failing for a real‑estate firm—AIOps helps pinpoint the problem and accelerate restoration.
- The platform ingests key sources (events, metrics, logs, alerts), creates baselines and thresholds, and defines what “normal” looks like for the application.
- It then contextualizes and surfaces actionable insights through collaborative tools, enabling operators and SREs to engage, troubleshoot, and restore service efficiently.
Sections
- AIOps: Preventing Costly Downtime - The speaker highlights the multi‑million‑dollar, brand‑damage, and regulatory risks of unexpected IT outages before defining AIOps as the application of AI, machine learning, and advanced analytics to operational data to empower IT professionals to detect, diagnose, and remediate incidents quickly.
- AIOps-Driven ChatOps Incident Resolution - The passage outlines how AIOps ingests logs, provides contextual insights via chat‑ops to SREs, and enables automated script execution to quickly resolve incidents.
Full Transcript
# AIOps: Preventing Downtime Costs **Source:** [https://www.youtube.com/watch?v=XbYKAJc5jhg](https://www.youtube.com/watch?v=XbYKAJc5jhg) **Duration:** 00:05:31 ## Summary - Unplanned IT downtime can cost businesses millions, damage their brand, and even trigger regulatory penalties. - AIOps (Artificial Intelligence for Operations) leverages AI, machine learning, and advanced analytics on operational data to give IT teams faster, data‑driven decision‑making power. - In a typical outage—like an invoicing app failing for a real‑estate firm—AIOps helps pinpoint the problem and accelerate restoration. - The platform ingests key sources (events, metrics, logs, alerts), creates baselines and thresholds, and defines what “normal” looks like for the application. - It then contextualizes and surfaces actionable insights through collaborative tools, enabling operators and SREs to engage, troubleshoot, and restore service efficiently. ## Sections - [00:00:00](https://www.youtube.com/watch?v=XbYKAJc5jhg&t=0s) **AIOps: Preventing Costly Downtime** - The speaker highlights the multi‑million‑dollar, brand‑damage, and regulatory risks of unexpected IT outages before defining AIOps as the application of AI, machine learning, and advanced analytics to operational data to empower IT professionals to detect, diagnose, and remediate incidents quickly. - [00:03:06](https://www.youtube.com/watch?v=XbYKAJc5jhg&t=186s) **AIOps-Driven ChatOps Incident Resolution** - The passage outlines how AIOps ingests logs, provides contextual insights via chat‑ops to SREs, and enables automated script execution to quickly resolve incidents. ## Full Transcript
So what can a few minutes of unplanned downtime cost a business? Six or seven figures in lost
revenue? A damaged brand? Regulatory action? Hi, my name is Albert Traylor and I'm with IBM
Cloud, and today I'm here to talk about artificial intelligence for operations, also known as AIOps.
So before we go into the definition of AIOps, let's think about a scenario.
Imagine that you are an it operations professional
at a successful real estate company. Let's call that company "Housing For All."
At h4a you support a portfolio of applications one of which is an invoicing application used by
tens of thousands of partners every day. For this specific application your focus is to make sure
it's up and that partners can deliver invoices consistently on it on a regular basis.
One day you settle into your desk you get a cup of coffee and you get a phone call,
just like that. Out of nowhere a sales rep is calling to complain that a partner is trying
to upload an invoice and has been unable to all morning because the application is down.
What do you do in that scenario to get this application back up and running?
So before we continue down that scenario let's talk about AIOps and the textbook definition.
Artificial intelligence for operations is about the application of artificial intelligence
machine learning models and advanced analytics to IT operational data. The objective
is to empower it professionals and operations professionals with the data they need to make
decisions and ultimately resolve and restore service to an application
faster. So with that definition in mind, let's talk about how we can get this invoice application
back up and running. Now let's think about the key data sources for our invoicing application.
Let's start with events, and metrics,
logs, alerts, and a few others but for now let's focus on these data sources
for our specific invoicing application. In the real world this will look different depending on your
application architecture, the type of data sources that are applicable for your application, and
data regulatory requirements. So we've got our key data sources, now how does this fit into our model
for AIOps? Let's think about three key steps. The first one we're going to call monitor and discover.
So in this first step the data is ingested by the AIOps platform,
and is thresholded and creates baselines for your specific application.
So let's think about it this way what is normal for my invoicing application?
What is the log ingestion rate, how many errors is acceptable based on our slo, or service level
objective? So we now have that information, the next step of the process is about engage and context.
So the next step is engage in context. So this is where AIOps really shines
it takes all of that ingested data, and it surfaces it to an IT operations professional
or site reliability engineer in the form of a collaboration solution, also known as chat ops.
So up until now everything's been done in the background and as soon as this incident pops
up with our invoicing application it surfaced via chat ops to our site reliability engineer.
Less is more here they have the context on where the incident is located in the application,
what specific actions are recommended to resolve this, and most importantly,
how are those actions based on incidents like this that have come up in the past?
So now our sre is armed with that information, and it's our last phase which is act and automate.
So everything's been done in the background, this information has been surfaced to our IT officer
sre, ITOps professional sre, and then finally we have to act and automate to resolve this issue.
The suggested options that are available via chat ops enable the ITOps to select what has
worked in the past and with a click activate a script or run book to resolve this issue
as soon as it's detected. This gets our invoicing application back up and running faster
and make sure that our partners are happier with this experience.
So now we've got the overview, three key steps, the type of data that's ingested by AIOps.
In summary this system allows its professionals to solve problems faster,
to keep applications up and running, and help protect the business in the long run.
Thank you. If you have any questions please drop us a line below. If you want to see more videos like
this in the future, please like and subscribe, and don't forget you can grow your skills and earn a
and earn a badge with IBM CloudLabs, which are free browser based interactive Kubernetes labs.