Learning Library

← Back to Library

Avoiding Global BSOD Disasters

11m • Unknown Channel • devops • tutorial • intermediate • Watch on YouTube ↗

Key Points

Vendors must perform extensive regression testing on a wide range of hardware and software configurations, not just a single “happy path,” to ensure new releases don’t break existing functionality.
The operating system kernel should be altered as little as possible; any changes to this core layer carry high risk of catastrophic failures like system crashes.
When software requires kernel‑level privileges, developers need rigorous safeguards and validation because a fault in that layer can bring down the entire system.
Users and clients should verify that vendor updates have been properly vetted in environments similar to their own and avoid installing untested patches that could affect kernel stability.
A shared preventive strategy—such as standardized, automated testing pipelines and strict change‑control policies for kernel modifications—helps both vendors and users keep large fleets of workstations from experiencing simultaneous blue‑screen failures.

Sections

00:00:00 Testing Software to Avoid Global Crashes - The speaker emphasizes comprehensive regression testing across diverse system configurations as essential for vendors and users to prevent widespread failures like massive blue‑screen incidents.

Full Transcript

# Avoiding Global BSOD Disasters **Source:** [https://www.youtube.com/watch?v=88PLU7Xt0oA](https://www.youtube.com/watch?v=88PLU7Xt0oA) **Duration:** 00:11:11 ## Summary - Vendors must perform extensive regression testing on a wide range of hardware and software configurations, not just a single “happy path,” to ensure new releases don’t break existing functionality. - The operating system kernel should be altered as little as possible; any changes to this core layer carry high risk of catastrophic failures like system crashes. - When software requires kernel‑level privileges, developers need rigorous safeguards and validation because a fault in that layer can bring down the entire system. - Users and clients should verify that vendor updates have been properly vetted in environments similar to their own and avoid installing untested patches that could affect kernel stability. - A shared preventive strategy—such as standardized, automated testing pipelines and strict change‑control policies for kernel modifications—helps both vendors and users keep large fleets of workstations from experiencing simultaneous blue‑screen failures. ## Sections - [00:00:00](https://www.youtube.com/watch?v=88PLU7Xt0oA&t=0s) **Testing Software to Avoid Global Crashes** - The speaker emphasizes comprehensive regression testing across diverse system configurations as essential for vendors and users to prevent widespread failures like massive blue‑screen incidents. ## Full Transcript

0:00you reboot your system and you get the 0:02dreaded blue screen of death oh no now 0:05you are having a bad day but what if 8 0:08million other workstations do the same 0:10thing now the whole planet is having a 0:12bad day well there's an old saying that 0:14says never let a good crisis go to waste 0:17so let's see what we can learn from this 0:20what kind of nuggets can we mine from 0:22the situation so that we make sure this 0:24doesn't happen again I'm going to go 0:26through Lessons Learned for vendors who 0:28are producing software 0:30and Lessons Learned for users and 0:32clients of those vendors and then one 0:34bonus topic that both can use in order 0:37to prevent this kind of thing from 0:38happening again in the future what's the 0:40first thing that a vendor could do to 0:42make sure this catastrophe doesn't 0:44happen well obviously you need to test 0:47your software and I don't mean just test 0:50it and make sure it's generally working 0:51we ran it on a system and now 0:52everything's fine no you need to run it 0:55with a full regression test where you're 0:57going to go back and make sure that all 0:59the old functions that used to work 1:01didn't get broken and that you haven't 1:03introduced any new errors and one of the 1:05really important parts of this is to 1:07make sure your software is probably 1:08going to run on a lot of different 1:10environments a lot of different kinds of 1:11systems so don't just check it on one 1:14type of system one configuration and say 1:16we're good to go in fact you need to 1:18test it on a lot of different systems 1:20because it may work perfectly well on 1:22those and not on this so you need a 1:25variety of of systems everything that 1:27someone could possibly run it on you 1:29want to test against so that's the first 1:31one second one you want to minimize 1:34modifications to the kernel now what do 1:37I mean by that well let's take a look 1:39over here so if you think about this is 1:42memory on a system that's the ram here's 1:45the disc drive the storage that we're 1:47going to have on it uh this is the CPU 1:50this is the network interface card so it 1:53can go out and talk to other systems 1:54these are just some examples of things 1:56that are interfaces that the operating 2:00system is going to have to work with so 2:02it needs to be able to mediate 2:04communication amongst all those 2:06different things well there is a core of 2:08the operating system that we call the 2:11kernel and the kernel is the thing that 2:14is really as I said the core if it gets 2:17compromised everything else falls apart 2:20so it's particularly important and by 2:21the way up here is where our apps are so 2:24apps are running on the OS the OS has 2:27this core and everything else runs 2:29around that 2:30in general we want not to be modifying 2:33this kernel in any way because that's 2:35the core of the system you make a 2:37mistake there and you see what the the 2:40consequences can be really dramatic but 2:43in some cases there is software where 2:45it's given kernel privileges where it's 2:47able to hook in the kernel and make 2:49certain changes in the way basically the 2:51operating system is working we got to be 2:54really really careful whenever we do 2:55that because if this thing fails then 2:58guess what so does this thing thing and 3:00then the whole system goes down what 3:02we'd rather do is in fact have our hooks 3:04here at the OS level or even better 3:08where it's possible have them up here 3:10that way the OS continues operating even 3:13when there's been a catastrophic failure 3:15of some sort into whatever this 3:16particular feature is and this one I 3:20think is maybe the most important for a 3:22vendor that's rolling out software if 3:24your software is going to be part of the 3:26critical path and clearly something that 3:28is hooking into the Curel is going to be 3:30part of the critical path because 3:32everything's going to depend on it and 3:34that is you want to do a phased roll 3:37out that is in this case if I've got a 3:40software update and I want to roll this 3:42out to the whole world I'm not doing it 3:45all at once here's what I'm going to do 3:47especially if I'm part of the critical 3:49path and especially if this software is 3:52running critical infrastructure I'm 3:54going to roll out to let's say the first 3:56five maybe 10% of my user ation and then 4:01sit back and wait and make sure 4:02everything's okay now maybe if this is a 4:05really critical update it needs to go 4:07out really quickly so we only wait a few 4:09hours if we can afford to wait a few 4:12days it's even better still that way if 4:14we find out if this system goes upside 4:17down well then I know I can stop the 4:19roll out and I don't affect everyone 4:21else then after I have an idea I'm 4:24staging or phasing this roll out and I 4:26roll out another 10% another 20% another 4:2950% as I get more and more confidence so 4:32those are the kinds of things that would 4:33make a huge difference if vendors would 4:35do this especially if your software is 4:38part of the critical path okay now we've 4:41seen what a vendor could do in order to 4:43prevent this kind of disaster from 4:44happening but what could a user or a 4:47client of that vendor do in order to 4:49protect themselves well the first most 4:52important thing business continuity 4:55Disaster Recovery planning everyone 4:57needs to do this you need to assume 5:00everything will fail look at every part 5:02of the system and imagine what happens 5:04if that thing doesn't work and then 5:06figure out how would we recover from 5:08that so that's the basic planning part 5:10of it and one of the most important 5:12parts of that is having good backups and 5:15the ability to restore from those 5:16backups so for instance here's my system 5:19operating and if it goes down then I 5:22need an ability to very quickly recover 5:25that system back to the state that it 5:28was in when I did the last back up or if 5:31I have lots of these systems maybe what 5:33I need is a master image that I can then 5:36restore across all of these very quickly 5:39and in some cases I'll even need to do 5:41what's referred to as a bare metal 5:43restore that is assume the operating 5:46system is not even on this system as if 5:48it was just brand new bare metal out of 5:51the box and I need to be able to push 5:53out an image that will work for all of 5:55those that will particularly make sense 5:58in cases where everyone uses essentially 6:01the same software configuration so that 6:04has to be in place and we need to test 6:06those plans and make sure that we can 6:07recover quickly when it occurs because 6:09it will if you wait long enough 6:11eventually it occurs second there's kind 6:14of a debate about do I patch or do I 6:17wait should I put the patch out right 6:20away or do I wait a bit think of it this 6:22way here's a timeline so we've got some 6:26point in time when a bug is discovered 6:29and then it takes the vendor a little 6:30bit of time and then eventually they 6:32come out with a patch some sort of 6:34software update that we're going to put 6:37on the systems now the question is for 6:39you when do you apply that well the this 6:45interval of time becomes very important 6:48and there's an interesting debate that 6:49can be had about how long that interval 6:51of time should be because if you apply 6:54all the patches instantly and there's a 6:57problem with it then you can see exactly 6:59what happen happens that's what happened 7:01in the case of of these in some cases 7:03you're not given the opportunity to even 7:04make this decision it's automatically 7:07pushed to your systems in general 7:09automatic software updates are a good 7:11idea unless again it's part of the 7:13critical infrastructure and the vendor 7:16didn't follow a lot of those kinds of 7:17principles then it could come back to 7:19bite you which is why some organizations 7:21say I don't want to apply new software 7:23updates right away I want to let it kind 7:26of curate a little bit and make sure 7:27that nobody else has problems with it 7:29then we'll put it on there however that 7:31makes sense from a stability standpoint 7:33from a security standpoint it means your 7:36interval of exposure is greater so that 7:40if this happened to be a security bug 7:42that was that just came out and now this 7:46is the patch the longer you wait to 7:48apply it the longer you're exposed and 7:50the more chances the bad guys have to 7:53come and attack you so there's a 7:55tradeoff in this and you got to be 7:58careful in terms of what decision you 7:59make another one that you need to 8:02consider is I've talked on this channel 8:04a lot about what I call the kiss 8:06principle I keep it simple stupid uh how 8:10simple do you want to make it well I've 8:12said before that complexity is the enemy 8:14of security so we'd like to keep the 8:16simple uh the system simple if we can 8:19and one approach to that is with a 8:22monoculture of software in that case 8:25every single one of the systems that are 8:27out there are configured exactly the 8:29same 8:30it's obviously easier if we find a bug 8:31in one system we know it's going to be 8:33the same bug in all of them and 8:34therefore we can go fix them all the 8:36same way and the management of that is a 8:38lot simpler and therefore it's easier to 8:41assure that we're going to get the right 8:43results however uh an argument can be 8:46made that we want a certain amount of 8:48diversity in the software that we don't 8:51want every single system to be exactly 8:53the same because if it is then any 8:56single bug that affects one system will 8:58take them all down so there might be 9:01some value in having a little bit of 9:03diversity in your environment but you're 9:05going to have to balance that with also 9:07trying to keep it simple so there are 9:09tradeoffs on the side of the user and 9:12hopefully you'll make the right decision 9:14on 9:15this okay now we've got three things 9:18that vendors can do and three things 9:20that the user Community can do in order 9:22to avoid these kind of software outages 9:25on a global scale how about one more 9:28that both of them can benefit from and 9:30I'm going to call that root cause 9:33analysis in this case what we're going 9:36to do is look for what was the 9:39underlying problem that allowed this to 9:41happen in the first place so we're going 9:43to do something in the military I think 9:45they refer to it as an after action 9:47review an old programmer like me we used 9:50to refer to this at the end of a project 9:52we would do what was called a postmortem 9:54analysis so we're going to go back and 9:56look it's a forensic analysis of the 9:58case with a specific eye towards looking 10:01at not only what was the defect but what 10:03caused that defect to allow it to happen 10:06in the first place what introduced that 10:09and why did we not catch it and more 10:12importantly we're going to look at doing 10:14defect Extinction in defect Extinction 10:17I'm going to try to put a process in 10:19place that allows that never to happen 10:22again we're going to catch that so that 10:24this is never something we deal with 10:25again and we'd like as much possible to 10:28use tools to use other automated 10:30processes and things like that just 10:32saying we're all going to try harder or 10:34we're going to educate the users ah that 10:36that only works so far better still if I 10:39can make it so it's not even possible 10:41for that to ever occur again then we 10:44benefit going forward and we never see 10:46that problem again so there you have it 10:48three things for vendors three things 10:50for users and one bonus topic for both 10:52of them those are just a short list of 10:55some of the things that could go into 10:57avoiding a catastrophe like this in the 10:59future I bet you can think of some 11:01others so in fact put those in the 11:04comments below and everyone else can 11:07learn from your ideas