Root Cause Analysis: 7 Essential Steps
Key Points
- An RCA (Root Cause Analysis) is a standardized seven‑step process used after any customer‑impacting incident—such as outages, network loss, or power failures—to identify the underlying cause and prevent recurrence.
- The first critical step is to clearly define the actual problem, distinguishing it from surface‑level symptoms like “the database went offline.”
- Collecting reliable data is essential; decisions must be evidence‑based rather than based on guesses or assumptions, even for glitches that appear to self‑resolve.
- The “why” stage requires deep causal questioning to trace each effect back through maintenance practices, schedules, equipment performance, and manufacturer specifications.
- Ultimately, the RCA framework guides teams to not only fix the immediate issue but also implement systematic changes that ensure the problem does not happen again.
Sections
- Untitled Section
- Ensuring Effective Root Cause Analysis - The speaker stresses that a true RCA must go beyond identifying a problem—by asking “why,” mapping causal chains, and implementing corrective actions that prevent recurrence rather than merely completing paperwork.
- Critical Role of Post-Incident Communication - The speaker emphasizes that transparent, thorough communication of root‑cause analysis findings, corrective actions, and preventive steps is the hardest but essential element for maintaining stakeholder trust after a customer‑impacting incident.
Full Transcript
# Root Cause Analysis: 7 Essential Steps **Source:** [https://www.youtube.com/watch?v=7t3lTYEd_PM](https://www.youtube.com/watch?v=7t3lTYEd_PM) **Duration:** 00:08:31 ## Summary - An RCA (Root Cause Analysis) is a standardized seven‑step process used after any customer‑impacting incident—such as outages, network loss, or power failures—to identify the underlying cause and prevent recurrence. - The first critical step is to clearly define the actual problem, distinguishing it from surface‑level symptoms like “the database went offline.” - Collecting reliable data is essential; decisions must be evidence‑based rather than based on guesses or assumptions, even for glitches that appear to self‑resolve. - The “why” stage requires deep causal questioning to trace each effect back through maintenance practices, schedules, equipment performance, and manufacturer specifications. - Ultimately, the RCA framework guides teams to not only fix the immediate issue but also implement systematic changes that ensure the problem does not happen again. ## Sections - [00:00:00](https://www.youtube.com/watch?v=7t3lTYEd_PM&t=0s) **Untitled Section** - - [00:03:08](https://www.youtube.com/watch?v=7t3lTYEd_PM&t=188s) **Ensuring Effective Root Cause Analysis** - The speaker stresses that a true RCA must go beyond identifying a problem—by asking “why,” mapping causal chains, and implementing corrective actions that prevent recurrence rather than merely completing paperwork. - [00:06:13](https://www.youtube.com/watch?v=7t3lTYEd_PM&t=373s) **Critical Role of Post-Incident Communication** - The speaker emphasizes that transparent, thorough communication of root‑cause analysis findings, corrective actions, and preventive steps is the hardest but essential element for maintaining stakeholder trust after a customer‑impacting incident. ## Full Transcript
Hi there, and thanks so much for
clicking on this video.
My name is Bradley Knapp and I'm
with IBM Cloud.
And the topic we're going to discuss
today is what is an RCA,
or a "Root Cause Analysis".
All right. So to start off
with, an RCA is a standard
process that you should go through
any time within the technology
industry that you have what I like
to call a customer impacting event,
or a serious event where
something has gone wrong and it has
resulted in serious
problems for your customers.
It could be a down time outage.
It could be a loss of network
connectivity.
It could be a loss
of electricity.
But no matter what the problem
is, this RCA process,
which is seven steps
that this RCA process is
designed to help you not only
identify what the problem is,
but how to fix it so that it doesn't
ever happen again.
So with that in mind,
let's jump right in.
Right. So the first step
in an RCA, and I know this seems
basic but it is the first and most
important step, is you must
identify
what went wrong, right?
You have to identify your problem.
And that means you have to define
your problem.
It's not just a matter of figuring
out the symptoms.
The symptom is my "computing
environment stopped being available"
or "the database dropped
offline". That's a symptom.
That's not identifying what the
problem is.
The first step in that RCA
is figure out what
it is that went wrong.
And so, in order to identify
what went wrong, your second
step, which is really related to the
first, is you must collect
data.
Because the decisions that you're
going to make as part of this RCA
process have to
be based in data.
They can't be based and guess they
can't be based in
conjecture.
You have to know what
went wrong and you have to have the
data to back that up.
Sometimes you get a little glitch
that resolves itself automatically.
You're still going to want to run an
RCA process on that and
you want to know what caused the
glitch. You can't just trust that
it's going to magically fix itself
in the future.
So you must collect data.
Now, next step is
you have identified your problem,
right? You've defined it.
You've got your data.
You now have to ask why,
and asking why is more than
just asking the question.
You have to make causal
connections.
So as we're asking
why,
we have to get really in-depth,
right? It can't just be
"all right, well, the power
went out and as a result, these
breakers tripped, and then when
the power came back on, the breakers
didn't automatically reset".
You have to know, why didn't they
automatically reset?
Is it because you weren't doing
your preventative maintenance
correctly or you were doing your
preventative maintenance correctly,
but the schedule wasn't right,
or even though you were doing the
preventative maintenance and the
schedule was right, maybe
the equipment just failed.
Did it fail inservice?
Did it fail out of service?
Do you need to go back to the
manufacturer of that equipment and
find out what on earth is going on?
Because things did not happen
the way that they should have.
You have to ask why and you have to
make those causal connections.
And remember, one of the biggest
goals of the RCA is to
not only figure out what you
did in order to solve the problem,
you have to figure out how to
keep it from happening again.
If you don't get to that point,
there's no point in doing an RCA,
it's just a paperwork exercise.
Why even bother with it?
So we've identified what went
wrong. We've got our data.
We've asked why, we've made those
causal connections between
everything that went wrong to figure
out what happened.
Because remember, in the world
of technology, our problems
are very, very rarely a
single thing.
It is almost always a cascading
error of some kind that
started with something simple
and cascaded into something serious.
So we've made those causal
connections.
Now we have to actually figure out
what are we going to fix.
Right. We are going to identify
those corrections.
So we've identified
what it is that we're going to fix,
and solving it
isn't just a matter of figuring out
what it is you're going to fix.
You've also got to figure out
how to keep it from happening again.
Right. And so once you've identified
your corrections, a huge, huge
piece of that is that you need to
figure out what defects did you find
in your data collection stage.
So do you need to improve your
monitoring?
Are you collecting all of the
things that you need?
Are you also logging it?
Did you figure out that you're
actually monitoring the things
correctly, but you're not storing
that data because monitoring
and logging go hand in hand?
There is no point in having active
monitoring that you don't also save
so that you can do this kind of
analysis later.
Likewise, there's no point in
logging data that no one is ever
looking at.
So they are hand in hand.
And that's a big piece of
identifying the corrections,
identifying what you are going to
fix.
All right. So we've found our gaps
in monitoring. We found our gaps in
logging.
We figured out what other
corrections it is that we need to
make now.
What do we do now?
We've got to implement the solution,
right. Now is the implementation
phase.
So,
implementation. We have
to implement not just the short term
fix that we use to solve the outage
that we had, but we also have to
implement all of the long term
things, right, monitoring, logging,
other corrections, software defects.
Maybe we have a change management
problem.
You have to implement all
of those fixes and you have
to get them out there, because
there's no point in doing the work
to just let it sit, right?
If you write up the RCA and you
don't implement all of the changes
that you need in order to
make sure it doesn't happen again,
again, it's a paperwork exercise.
Not worth identifying the time.
And then the last step, and this is
the one that is often the hardest
for everyone involved in any
kind of a customer impacting
event, is communication.
So I'm just going to put this up
here as "comms".
And I'm going to underline it,
actually, I'm going to underline it
twice.
Communication is so important
because once you have figured out
what the problem is, you figured out
how you're going to fix it, and you
figured out what gaps and defects
you have, you've implemented those
gaps and defects, you have to
keep your stakeholders apprized
of what is going on.
And it's hard for us to admit that
things went wrong. It's hard for us
to admit that there were failures
that were our fault, that we have
acknowledged and we're going to fix.
And so if we think about a company
culture, comms
around RCAs are so
important because it is
acknowledging to your customers,
"yes, we're not perfect".
We made a mistake, or
the vendors that we selected had a
problem, or really
just about anything that
could go wrong will eventually go
wrong. But we are reassuring
you through our communication in
this RCA process that
we know that things happen and we
are going to fix it and ensure that
it never happens again.
This comms piece is the most
important part.
You can't just write up a two or
three sentence, "Yes, something
broke. We have identified fixes.
We've implemented them and we'll
make sure it doesn't happen again".
You need to go one level further
than that. You have to restore
that trust and restore your
customers confidence in you
that you are acknowledging
that you are not perfect and you're
going to fix things so that they
don't happen again.
And then you've got to keep
that communication going.
Once you've reached this
implementation phase and you've
actually got the problem rolled,
out at that point
keep up with your customers.
If somebody had a CIE six months
ago, reach out to them.
Be sure that they're still OK.
Be sure that they have accepted
the plan that you gave them on how
it's never going to happen again and
be sure that they are OK with it and
that it's solving their needs.
So thank you so much.
Hopefully this was helpful to you.
If you have any questions or
comments, please feel free to share
them with us below.
If you enjoyed this video and you
would like to see more like it in
the future, please do like the video
and subscribe to us so that we'll
know to keep creating for you.