Learning Library

← Back to Library

Root Cause Analysis: 7 Essential Steps

Key Points

  • An RCA (Root Cause Analysis) is a standardized seven‑step process used after any customer‑impacting incident—such as outages, network loss, or power failures—to identify the underlying cause and prevent recurrence.
  • The first critical step is to clearly define the actual problem, distinguishing it from surface‑level symptoms like “the database went offline.”
  • Collecting reliable data is essential; decisions must be evidence‑based rather than based on guesses or assumptions, even for glitches that appear to self‑resolve.
  • The “why” stage requires deep causal questioning to trace each effect back through maintenance practices, schedules, equipment performance, and manufacturer specifications.
  • Ultimately, the RCA framework guides teams to not only fix the immediate issue but also implement systematic changes that ensure the problem does not happen again.

Full Transcript

# Root Cause Analysis: 7 Essential Steps **Source:** [https://www.youtube.com/watch?v=7t3lTYEd_PM](https://www.youtube.com/watch?v=7t3lTYEd_PM) **Duration:** 00:08:31 ## Summary - An RCA (Root Cause Analysis) is a standardized seven‑step process used after any customer‑impacting incident—such as outages, network loss, or power failures—to identify the underlying cause and prevent recurrence. - The first critical step is to clearly define the actual problem, distinguishing it from surface‑level symptoms like “the database went offline.” - Collecting reliable data is essential; decisions must be evidence‑based rather than based on guesses or assumptions, even for glitches that appear to self‑resolve. - The “why” stage requires deep causal questioning to trace each effect back through maintenance practices, schedules, equipment performance, and manufacturer specifications. - Ultimately, the RCA framework guides teams to not only fix the immediate issue but also implement systematic changes that ensure the problem does not happen again. ## Sections - [00:00:00](https://www.youtube.com/watch?v=7t3lTYEd_PM&t=0s) **Untitled Section** - - [00:03:08](https://www.youtube.com/watch?v=7t3lTYEd_PM&t=188s) **Ensuring Effective Root Cause Analysis** - The speaker stresses that a true RCA must go beyond identifying a problem—by asking “why,” mapping causal chains, and implementing corrective actions that prevent recurrence rather than merely completing paperwork. - [00:06:13](https://www.youtube.com/watch?v=7t3lTYEd_PM&t=373s) **Critical Role of Post-Incident Communication** - The speaker emphasizes that transparent, thorough communication of root‑cause analysis findings, corrective actions, and preventive steps is the hardest but essential element for maintaining stakeholder trust after a customer‑impacting incident. ## Full Transcript
0:00Hi there, and thanks so much for 0:01clicking on this video. 0:02My name is Bradley Knapp and I'm 0:04with IBM Cloud. 0:05And the topic we're going to discuss 0:07today is what is an RCA, 0:09or a "Root Cause Analysis". 0:13All right. So to start off 0:15with, an RCA is a standard 0:17process that you should go through 0:19any time within the technology 0:21industry that you have what I like 0:23to call a customer impacting event, 0:25or a serious event where 0:27something has gone wrong and it has 0:29resulted in serious 0:31problems for your customers. 0:33It could be a down time outage. 0:35It could be a loss of network 0:37connectivity. 0:38It could be a loss 0:40of electricity. 0:41But no matter what the problem 0:44is, this RCA process, 0:46which is seven steps 0:48that this RCA process is 0:50designed to help you not only 0:51identify what the problem is, 0:54but how to fix it so that it doesn't 0:56ever happen again. 0:56So with that in mind, 0:59let's jump right in. 1:00Right. So the first step 1:01in an RCA, and I know this seems 1:03basic but it is the first and most 1:05important step, is you must 1:08identify 1:11what went wrong, right? 1:12You have to identify your problem. 1:14And that means you have to define 1:16your problem. 1:17It's not just a matter of figuring 1:18out the symptoms. 1:19The symptom is my "computing 1:21environment stopped being available" 1:23or "the database dropped 1:25offline". That's a symptom. 1:26That's not identifying what the 1:28problem is. 1:29The first step in that RCA 1:32is figure out what 1:33it is that went wrong. 1:35And so, in order to identify 1:37what went wrong, your second 1:39step, which is really related to the 1:41first, is you must collect 1:44data. 1:46Because the decisions that you're 1:47going to make as part of this RCA 1:49process have to 1:51be based in data. 1:52They can't be based and guess they 1:54can't be based in 1:57conjecture. 1:58You have to know what 2:00went wrong and you have to have the 2:01data to back that up. 2:03Sometimes you get a little glitch 2:05that resolves itself automatically. 2:07You're still going to want to run an 2:08RCA process on that and 2:10you want to know what caused the 2:12glitch. You can't just trust that 2:13it's going to magically fix itself 2:15in the future. 2:16So you must collect data. 2:18Now, next step is 2:20you have identified your problem, 2:22right? You've defined it. 2:23You've got your data. 2:24You now have to ask why, 2:27and asking why is more than 2:28just asking the question. 2:30You have to make causal 2:31connections. 2:33So as we're asking 2:35why, 2:38we have to get really in-depth, 2:41right? It can't just be 2:43"all right, well, the power 2:45went out and as a result, these 2:47breakers tripped, and then when 2:48the power came back on, the breakers 2:50didn't automatically reset". 2:51You have to know, why didn't they 2:53automatically reset? 2:54Is it because you weren't doing 2:56your preventative maintenance 2:57correctly or you were doing your 2:59preventative maintenance correctly, 3:00but the schedule wasn't right, 3:02or even though you were doing the 3:04preventative maintenance and the 3:05schedule was right, maybe 3:07the equipment just failed. 3:08Did it fail inservice? 3:10Did it fail out of service? 3:12Do you need to go back to the 3:13manufacturer of that equipment and 3:15find out what on earth is going on? 3:17Because things did not happen 3:19the way that they should have. 3:20You have to ask why and you have to 3:22make those causal connections. 3:24And remember, one of the biggest 3:25goals of the RCA is to 3:27not only figure out what you 3:29did in order to solve the problem, 3:32you have to figure out how to 3:34keep it from happening again. 3:36If you don't get to that point, 3:37there's no point in doing an RCA, 3:39it's just a paperwork exercise. 3:41Why even bother with it? 3:43So we've identified what went 3:45wrong. We've got our data. 3:46We've asked why, we've made those 3:48causal connections between 3:50everything that went wrong to figure 3:51out what happened. 3:53Because remember, in the world 3:55of technology, our problems 3:57are very, very rarely a 3:58single thing. 3:59It is almost always a cascading 4:01error of some kind that 4:03started with something simple 4:05and cascaded into something serious. 4:08So we've made those causal 4:09connections. 4:11Now we have to actually figure out 4:12what are we going to fix. 4:14Right. We are going to identify 4:19those corrections. 4:24So we've identified 4:26what it is that we're going to fix, 4:28and solving it 4:30isn't just a matter of figuring out 4:32what it is you're going to fix. 4:33You've also got to figure out 4:35how to keep it from happening again. 4:36Right. And so once you've identified 4:38your corrections, a huge, huge 4:40piece of that is that you need to 4:42figure out what defects did you find 4:44in your data collection stage. 4:46So do you need to improve your 4:48monitoring? 4:50Are you collecting all of the 4:52things that you need? 4:53Are you also logging it? 4:55Did you figure out that you're 4:57actually monitoring the things 4:58correctly, but you're not storing 4:59that data because monitoring 5:01and logging go hand in hand? 5:03There is no point in having active 5:05monitoring that you don't also save 5:07so that you can do this kind of 5:09analysis later. 5:10Likewise, there's no point in 5:11logging data that no one is ever 5:13looking at. 5:14So they are hand in hand. 5:15And that's a big piece of 5:16identifying the corrections, 5:18identifying what you are going to 5:20fix. 5:21All right. So we've found our gaps 5:23in monitoring. We found our gaps in 5:25logging. 5:26We figured out what other 5:27corrections it is that we need to 5:28make now. 5:29What do we do now? 5:30We've got to implement the solution, 5:32right. Now is the implementation 5:33phase. 5:35So, 5:40implementation. We have 5:42to implement not just the short term 5:44fix that we use to solve the outage 5:45that we had, but we also have to 5:47implement all of the long term 5:49things, right, monitoring, logging, 5:51other corrections, software defects. 5:52Maybe we have a change management 5:54problem. 5:55You have to implement all 5:57of those fixes and you have 5:59to get them out there, because 6:01there's no point in doing the work 6:03to just let it sit, right? 6:04If you write up the RCA and you 6:06don't implement all of the changes 6:08that you need in order to 6:10make sure it doesn't happen again, 6:12again, it's a paperwork exercise. 6:13Not worth identifying the time. 6:15And then the last step, and this is 6:17the one that is often the hardest 6:19for everyone involved in any 6:21kind of a customer impacting 6:23event, is communication. 6:25So I'm just going to put this up 6:27here as "comms". 6:29And I'm going to underline it, 6:30actually, I'm going to underline it 6:31twice. 6:32Communication is so important 6:35because once you have figured out 6:36what the problem is, you figured out 6:38how you're going to fix it, and you 6:39figured out what gaps and defects 6:41you have, you've implemented those 6:42gaps and defects, you have to 6:44keep your stakeholders apprized 6:46of what is going on. 6:47And it's hard for us to admit that 6:49things went wrong. It's hard for us 6:50to admit that there were failures 6:51that were our fault, that we have 6:53acknowledged and we're going to fix. 6:55And so if we think about a company 6:57culture, comms 6:59around RCAs are so 7:01important because it is 7:02acknowledging to your customers, 7:04"yes, we're not perfect". 7:06We made a mistake, or 7:08the vendors that we selected had a 7:09problem, or really 7:11just about anything that 7:13could go wrong will eventually go 7:15wrong. But we are reassuring 7:17you through our communication in 7:19this RCA process that 7:21we know that things happen and we 7:23are going to fix it and ensure that 7:24it never happens again. 7:26This comms piece is the most 7:28important part. 7:29You can't just write up a two or 7:30three sentence, "Yes, something 7:32broke. We have identified fixes. 7:34We've implemented them and we'll 7:35make sure it doesn't happen again". 7:37You need to go one level further 7:39than that. You have to restore 7:41that trust and restore your 7:43customers confidence in you 7:45that you are acknowledging 7:47that you are not perfect and you're 7:48going to fix things so that they 7:50don't happen again. 7:52And then you've got to keep 7:54that communication going. 7:55Once you've reached this 7:56implementation phase and you've 7:58actually got the problem rolled, 8:00out at that point 8:02keep up with your customers. 8:03If somebody had a CIE six months 8:05ago, reach out to them. 8:06Be sure that they're still OK. 8:07Be sure that they have accepted 8:09the plan that you gave them on how 8:11it's never going to happen again and 8:13be sure that they are OK with it and 8:15that it's solving their needs. 8:17So thank you so much. 8:18Hopefully this was helpful to you. 8:19If you have any questions or 8:20comments, please feel free to share 8:22them with us below. 8:23If you enjoyed this video and you 8:25would like to see more like it in 8:26the future, please do like the video 8:28and subscribe to us so that we'll 8:29know to keep creating for you.