Learning Library

← Back to Library

Root Cause Analysis: 7 Essential Steps

8m • Unknown Channel • devops • tutorial • intermediate • Watch on YouTube ↗

Key Points

An RCA (Root Cause Analysis) is a standardized seven‑step process used after any customer‑impacting incident—such as outages, network loss, or power failures—to identify the underlying cause and prevent recurrence.
The first critical step is to clearly define the actual problem, distinguishing it from surface‑level symptoms like “the database went offline.”
Collecting reliable data is essential; decisions must be evidence‑based rather than based on guesses or assumptions, even for glitches that appear to self‑resolve.
The “why” stage requires deep causal questioning to trace each effect back through maintenance practices, schedules, equipment performance, and manufacturer specifications.
Ultimately, the RCA framework guides teams to not only fix the immediate issue but also implement systematic changes that ensure the problem does not happen again.

Sections

Full Transcript

# Root Cause Analysis: 7 Essential Steps **Source:** [https://www.youtube.com/watch?v=7t3lTYEd_PM](https://www.youtube.com/watch?v=7t3lTYEd_PM) **Duration:** 00:08:31 ## Summary - An RCA (Root Cause Analysis) is a standardized seven‑step process used after any customer‑impacting incident—such as outages, network loss, or power failures—to identify the underlying cause and prevent recurrence. - The first critical step is to clearly define the actual problem, distinguishing it from surface‑level symptoms like “the database went offline.” - Collecting reliable data is essential; decisions must be evidence‑based rather than based on guesses or assumptions, even for glitches that appear to self‑resolve. - The “why” stage requires deep causal questioning to trace each effect back through maintenance practices, schedules, equipment performance, and manufacturer specifications. - Ultimately, the RCA framework guides teams to not only fix the immediate issue but also implement systematic changes that ensure the problem does not happen again. ## Sections - [00:00:00](https://www.youtube.com/watch?v=7t3lTYEd_PM&t=0s) **Untitled Section** - - [00:03:08](https://www.youtube.com/watch?v=7t3lTYEd_PM&t=188s) **Ensuring Effective Root Cause Analysis** - The speaker stresses that a true RCA must go beyond identifying a problem—by asking “why,” mapping causal chains, and implementing corrective actions that prevent recurrence rather than merely completing paperwork. - [00:06:13](https://www.youtube.com/watch?v=7t3lTYEd_PM&t=373s) **Critical Role of Post-Incident Communication** - The speaker emphasizes that transparent, thorough communication of root‑cause analysis findings, corrective actions, and preventive steps is the hardest but essential element for maintaining stakeholder trust after a customer‑impacting incident. ## Full Transcript

0:00Hi there, and thanks so much for 0:01clicking on this video. 0:02My name is Bradley Knapp and I'm 0:04with IBM Cloud. 0:05And the topic we're going to discuss 0:07today is what is an RCA, 0:09or a "Root Cause Analysis". 0:13All right. So to start off 0:15with, an RCA is a standard 0:17process that you should go through 0:19any time within the technology 0:21industry that you have what I like 0:23to call a customer impacting event, 0:25or a serious event where 0:27something has gone wrong and it has 0:29resulted in serious 0:31problems for your customers. 0:33It could be a down time outage. 0:35It could be a loss of network 0:37connectivity. 0:38It could be a loss 0:40of electricity. 0:41But no matter what the problem 0:44is, this RCA process, 0:46which is seven steps 0:48that this RCA process is 0:50designed to help you not only 0:51identify what the problem is, 0:54but how to fix it so that it doesn't 0:56ever happen again. 0:56So with that in mind, 0:59let's jump right in. 1:00Right. So the first step 1:01in an RCA, and I know this seems 1:03basic but it is the first and most 1:05important step, is you must 1:08identify 1:11what went wrong, right? 1:12You have to identify your problem. 1:14And that means you have to define 1:16your problem. 1:17It's not just a matter of figuring 1:18out the symptoms. 1:19The symptom is my "computing 1:21environment stopped being available" 1:23or "the database dropped 1:25offline". That's a symptom. 1:26That's not identifying what the 1:28problem is. 1:29The first step in that RCA 1:32is figure out what 1:33it is that went wrong. 1:35And so, in order to identify 1:37what went wrong, your second 1:39step, which is really related to the 1:41first, is you must collect 1:44data. 1:46Because the decisions that you're 1:47going to make as part of this RCA 1:49process have to 1:51be based in data. 1:52They can't be based and guess they 1:54can't be based in 1:57conjecture. 1:58You have to know what 2:00went wrong and you have to have the 2:01data to back that up. 2:03Sometimes you get a little glitch 2:05that resolves itself automatically. 2:07You're still going to want to run an 2:08RCA process on that and 2:10you want to know what caused the 2:12glitch. You can't just trust that 2:13it's going to magically fix itself 2:15in the future. 2:16So you must collect data. 2:18Now, next step is 2:20you have identified your problem, 2:22right? You've defined it. 2:23You've got your data. 2:24You now have to ask why, 2:27and asking why is more than 2:28just asking the question. 2:30You have to make causal 2:31connections. 2:33So as we're asking 2:35why, 2:38we have to get really in-depth, 2:41right? It can't just be 2:43"all right, well, the power 2:45went out and as a result, these 2:47breakers tripped, and then when 2:48the power came back on, the breakers 2:50didn't automatically reset". 2:51You have to know, why didn't they 2:53automatically reset? 2:54Is it because you weren't doing 2:56your preventative maintenance 2:57correctly or you were doing your 2:59preventative maintenance correctly, 3:00but the schedule wasn't right, 3:02or even though you were doing the 3:04preventative maintenance and the 3:05schedule was right, maybe 3:07the equipment just failed. 3:08Did it fail inservice? 3:10Did it fail out of service? 3:12Do you need to go back to the 3:13manufacturer of that equipment and 3:15find out what on earth is going on? 3:17Because things did not happen 3:19the way that they should have. 3:20You have to ask why and you have to 3:22make those causal connections. 3:24And remember, one of the biggest 3:25goals of the RCA is to 3:27not only figure out what you 3:29did in order to solve the problem, 3:32you have to figure out how to 3:34keep it from happening again. 3:36If you don't get to that point, 3:37there's no point in doing an RCA, 3:39it's just a paperwork exercise. 3:41Why even bother with it? 3:43So we've identified what went 3:45wrong. We've got our data. 3:46We've asked why, we've made those 3:48causal connections between 3:50everything that went wrong to figure 3:51out what happened. 3:53Because remember, in the world 3:55of technology, our problems 3:57are very, very rarely a 3:58single thing. 3:59It is almost always a cascading 4:01error of some kind that 4:03started with something simple 4:05and cascaded into something serious. 4:08So we've made those causal 4:09connections. 4:11Now we have to actually figure out 4:12what are we going to fix. 4:14Right. We are going to identify 4:19those corrections. 4:24So we've identified 4:26what it is that we're going to fix, 4:28and solving it 4:30isn't just a matter of figuring out 4:32what it is you're going to fix. 4:33You've also got to figure out 4:35how to keep it from happening again. 4:36Right. And so once you've identified 4:38your corrections, a huge, huge 4:40piece of that is that you need to 4:42figure out what defects did you find 4:44in your data collection stage. 4:46So do you need to improve your 4:48monitoring? 4:50Are you collecting all of the 4:52things that you need? 4:53Are you also logging it? 4:55Did you figure out that you're 4:57actually monitoring the things 4:58correctly, but you're not storing 4:59that data because monitoring 5:01and logging go hand in hand? 5:03There is no point in having active 5:05monitoring that you don't also save 5:07so that you can do this kind of 5:09analysis later. 5:10Likewise, there's no point in 5:11logging data that no one is ever 5:13looking at. 5:14So they are hand in hand. 5:15And that's a big piece of 5:16identifying the corrections, 5:18identifying what you are going to 5:20fix. 5:21All right. So we've found our gaps 5:23in monitoring. We found our gaps in 5:25logging. 5:26We figured out what other 5:27corrections it is that we need to 5:28make now. 5:29What do we do now? 5:30We've got to implement the solution, 5:32right. Now is the implementation 5:33phase. 5:35So, 5:40implementation. We have 5:42to implement not just the short term 5:44fix that we use to solve the outage 5:45that we had, but we also have to 5:47implement all of the long term 5:49things, right, monitoring, logging, 5:51other corrections, software defects. 5:52Maybe we have a change management 5:54problem. 5:55You have to implement all 5:57of those fixes and you have 5:59to get them out there, because 6:01there's no point in doing the work 6:03to just let it sit, right? 6:04If you write up the RCA and you 6:06don't implement all of the changes 6:08that you need in order to 6:10make sure it doesn't happen again, 6:12again, it's a paperwork exercise. 6:13Not worth identifying the time. 6:15And then the last step, and this is 6:17the one that is often the hardest 6:19for everyone involved in any 6:21kind of a customer impacting 6:23event, is communication. 6:25So I'm just going to put this up 6:27here as "comms". 6:29And I'm going to underline it, 6:30actually, I'm going to underline it 6:31twice. 6:32Communication is so important 6:35because once you have figured out 6:36what the problem is, you figured out 6:38how you're going to fix it, and you 6:39figured out what gaps and defects 6:41you have, you've implemented those 6:42gaps and defects, you have to 6:44keep your stakeholders apprized 6:46of what is going on. 6:47And it's hard for us to admit that 6:49things went wrong. It's hard for us 6:50to admit that there were failures 6:51that were our fault, that we have 6:53acknowledged and we're going to fix. 6:55And so if we think about a company 6:57culture, comms 6:59around RCAs are so 7:01important because it is 7:02acknowledging to your customers, 7:04"yes, we're not perfect". 7:06We made a mistake, or 7:08the vendors that we selected had a 7:09problem, or really 7:11just about anything that 7:13could go wrong will eventually go 7:15wrong. But we are reassuring 7:17you through our communication in 7:19this RCA process that 7:21we know that things happen and we 7:23are going to fix it and ensure that 7:24it never happens again. 7:26This comms piece is the most 7:28important part. 7:29You can't just write up a two or 7:30three sentence, "Yes, something 7:32broke. We have identified fixes. 7:34We've implemented them and we'll 7:35make sure it doesn't happen again". 7:37You need to go one level further 7:39than that. You have to restore 7:41that trust and restore your 7:43customers confidence in you 7:45that you are acknowledging 7:47that you are not perfect and you're 7:48going to fix things so that they 7:50don't happen again. 7:52And then you've got to keep 7:54that communication going. 7:55Once you've reached this 7:56implementation phase and you've 7:58actually got the problem rolled, 8:00out at that point 8:02keep up with your customers. 8:03If somebody had a CIE six months 8:05ago, reach out to them. 8:06Be sure that they're still OK. 8:07Be sure that they have accepted 8:09the plan that you gave them on how 8:11it's never going to happen again and 8:13be sure that they are OK with it and 8:15that it's solving their needs. 8:17So thank you so much. 8:18Hopefully this was helpful to you. 8:19If you have any questions or 8:20comments, please feel free to share 8:22them with us below. 8:23If you enjoyed this video and you 8:25would like to see more like it in 8:26the future, please do like the video 8:28and subscribe to us so that we'll 8:29know to keep creating for you.