Avoiding Global BSOD Disasters
Key Points
- Vendors must perform extensive regression testing on a wide range of hardware and software configurations, not just a single “happy path,” to ensure new releases don’t break existing functionality.
- The operating system kernel should be altered as little as possible; any changes to this core layer carry high risk of catastrophic failures like system crashes.
- When software requires kernel‑level privileges, developers need rigorous safeguards and validation because a fault in that layer can bring down the entire system.
- Users and clients should verify that vendor updates have been properly vetted in environments similar to their own and avoid installing untested patches that could affect kernel stability.
- A shared preventive strategy—such as standardized, automated testing pipelines and strict change‑control policies for kernel modifications—helps both vendors and users keep large fleets of workstations from experiencing simultaneous blue‑screen failures.
Full Transcript
# Avoiding Global BSOD Disasters **Source:** [https://www.youtube.com/watch?v=88PLU7Xt0oA](https://www.youtube.com/watch?v=88PLU7Xt0oA) **Duration:** 00:11:11 ## Summary - Vendors must perform extensive regression testing on a wide range of hardware and software configurations, not just a single “happy path,” to ensure new releases don’t break existing functionality. - The operating system kernel should be altered as little as possible; any changes to this core layer carry high risk of catastrophic failures like system crashes. - When software requires kernel‑level privileges, developers need rigorous safeguards and validation because a fault in that layer can bring down the entire system. - Users and clients should verify that vendor updates have been properly vetted in environments similar to their own and avoid installing untested patches that could affect kernel stability. - A shared preventive strategy—such as standardized, automated testing pipelines and strict change‑control policies for kernel modifications—helps both vendors and users keep large fleets of workstations from experiencing simultaneous blue‑screen failures. ## Sections - [00:00:00](https://www.youtube.com/watch?v=88PLU7Xt0oA&t=0s) **Testing Software to Avoid Global Crashes** - The speaker emphasizes comprehensive regression testing across diverse system configurations as essential for vendors and users to prevent widespread failures like massive blue‑screen incidents. ## Full Transcript
you reboot your system and you get the
dreaded blue screen of death oh no now
you are having a bad day but what if 8
million other workstations do the same
thing now the whole planet is having a
bad day well there's an old saying that
says never let a good crisis go to waste
so let's see what we can learn from this
what kind of nuggets can we mine from
the situation so that we make sure this
doesn't happen again I'm going to go
through Lessons Learned for vendors who
are producing software
and Lessons Learned for users and
clients of those vendors and then one
bonus topic that both can use in order
to prevent this kind of thing from
happening again in the future what's the
first thing that a vendor could do to
make sure this catastrophe doesn't
happen well obviously you need to test
your software and I don't mean just test
it and make sure it's generally working
we ran it on a system and now
everything's fine no you need to run it
with a full regression test where you're
going to go back and make sure that all
the old functions that used to work
didn't get broken and that you haven't
introduced any new errors and one of the
really important parts of this is to
make sure your software is probably
going to run on a lot of different
environments a lot of different kinds of
systems so don't just check it on one
type of system one configuration and say
we're good to go in fact you need to
test it on a lot of different systems
because it may work perfectly well on
those and not on this so you need a
variety of of systems everything that
someone could possibly run it on you
want to test against so that's the first
one second one you want to minimize
modifications to the kernel now what do
I mean by that well let's take a look
over here so if you think about this is
memory on a system that's the ram here's
the disc drive the storage that we're
going to have on it uh this is the CPU
this is the network interface card so it
can go out and talk to other systems
these are just some examples of things
that are interfaces that the operating
system is going to have to work with so
it needs to be able to mediate
communication amongst all those
different things well there is a core of
the operating system that we call the
kernel and the kernel is the thing that
is really as I said the core if it gets
compromised everything else falls apart
so it's particularly important and by
the way up here is where our apps are so
apps are running on the OS the OS has
this core and everything else runs
around that
in general we want not to be modifying
this kernel in any way because that's
the core of the system you make a
mistake there and you see what the the
consequences can be really dramatic but
in some cases there is software where
it's given kernel privileges where it's
able to hook in the kernel and make
certain changes in the way basically the
operating system is working we got to be
really really careful whenever we do
that because if this thing fails then
guess what so does this thing thing and
then the whole system goes down what
we'd rather do is in fact have our hooks
here at the OS level or even better
where it's possible have them up here
that way the OS continues operating even
when there's been a catastrophic failure
of some sort into whatever this
particular feature is and this one I
think is maybe the most important for a
vendor that's rolling out software if
your software is going to be part of the
critical path and clearly something that
is hooking into the Curel is going to be
part of the critical path because
everything's going to depend on it and
that is you want to do a phased roll
out that is in this case if I've got a
software update and I want to roll this
out to the whole world I'm not doing it
all at once here's what I'm going to do
especially if I'm part of the critical
path and especially if this software is
running critical infrastructure I'm
going to roll out to let's say the first
five maybe 10% of my user ation and then
sit back and wait and make sure
everything's okay now maybe if this is a
really critical update it needs to go
out really quickly so we only wait a few
hours if we can afford to wait a few
days it's even better still that way if
we find out if this system goes upside
down well then I know I can stop the
roll out and I don't affect everyone
else then after I have an idea I'm
staging or phasing this roll out and I
roll out another 10% another 20% another
50% as I get more and more confidence so
those are the kinds of things that would
make a huge difference if vendors would
do this especially if your software is
part of the critical path okay now we've
seen what a vendor could do in order to
prevent this kind of disaster from
happening but what could a user or a
client of that vendor do in order to
protect themselves well the first most
important thing business continuity
Disaster Recovery planning everyone
needs to do this you need to assume
everything will fail look at every part
of the system and imagine what happens
if that thing doesn't work and then
figure out how would we recover from
that so that's the basic planning part
of it and one of the most important
parts of that is having good backups and
the ability to restore from those
backups so for instance here's my system
operating and if it goes down then I
need an ability to very quickly recover
that system back to the state that it
was in when I did the last back up or if
I have lots of these systems maybe what
I need is a master image that I can then
restore across all of these very quickly
and in some cases I'll even need to do
what's referred to as a bare metal
restore that is assume the operating
system is not even on this system as if
it was just brand new bare metal out of
the box and I need to be able to push
out an image that will work for all of
those that will particularly make sense
in cases where everyone uses essentially
the same software configuration so that
has to be in place and we need to test
those plans and make sure that we can
recover quickly when it occurs because
it will if you wait long enough
eventually it occurs second there's kind
of a debate about do I patch or do I
wait should I put the patch out right
away or do I wait a bit think of it this
way here's a timeline so we've got some
point in time when a bug is discovered
and then it takes the vendor a little
bit of time and then eventually they
come out with a patch some sort of
software update that we're going to put
on the systems now the question is for
you when do you apply that well the this
interval of time becomes very important
and there's an interesting debate that
can be had about how long that interval
of time should be because if you apply
all the patches instantly and there's a
problem with it then you can see exactly
what happen happens that's what happened
in the case of of these in some cases
you're not given the opportunity to even
make this decision it's automatically
pushed to your systems in general
automatic software updates are a good
idea unless again it's part of the
critical infrastructure and the vendor
didn't follow a lot of those kinds of
principles then it could come back to
bite you which is why some organizations
say I don't want to apply new software
updates right away I want to let it kind
of curate a little bit and make sure
that nobody else has problems with it
then we'll put it on there however that
makes sense from a stability standpoint
from a security standpoint it means your
interval of exposure is greater so that
if this happened to be a security bug
that was that just came out and now this
is the patch the longer you wait to
apply it the longer you're exposed and
the more chances the bad guys have to
come and attack you so there's a
tradeoff in this and you got to be
careful in terms of what decision you
make another one that you need to
consider is I've talked on this channel
a lot about what I call the kiss
principle I keep it simple stupid uh how
simple do you want to make it well I've
said before that complexity is the enemy
of security so we'd like to keep the
simple uh the system simple if we can
and one approach to that is with a
monoculture of software in that case
every single one of the systems that are
out there are configured exactly the
same
it's obviously easier if we find a bug
in one system we know it's going to be
the same bug in all of them and
therefore we can go fix them all the
same way and the management of that is a
lot simpler and therefore it's easier to
assure that we're going to get the right
results however uh an argument can be
made that we want a certain amount of
diversity in the software that we don't
want every single system to be exactly
the same because if it is then any
single bug that affects one system will
take them all down so there might be
some value in having a little bit of
diversity in your environment but you're
going to have to balance that with also
trying to keep it simple so there are
tradeoffs on the side of the user and
hopefully you'll make the right decision
on
this okay now we've got three things
that vendors can do and three things
that the user Community can do in order
to avoid these kind of software outages
on a global scale how about one more
that both of them can benefit from and
I'm going to call that root cause
analysis in this case what we're going
to do is look for what was the
underlying problem that allowed this to
happen in the first place so we're going
to do something in the military I think
they refer to it as an after action
review an old programmer like me we used
to refer to this at the end of a project
we would do what was called a postmortem
analysis so we're going to go back and
look it's a forensic analysis of the
case with a specific eye towards looking
at not only what was the defect but what
caused that defect to allow it to happen
in the first place what introduced that
and why did we not catch it and more
importantly we're going to look at doing
defect Extinction in defect Extinction
I'm going to try to put a process in
place that allows that never to happen
again we're going to catch that so that
this is never something we deal with
again and we'd like as much possible to
use tools to use other automated
processes and things like that just
saying we're all going to try harder or
we're going to educate the users ah that
that only works so far better still if I
can make it so it's not even possible
for that to ever occur again then we
benefit going forward and we never see
that problem again so there you have it
three things for vendors three things
for users and one bonus topic for both
of them those are just a short list of
some of the things that could go into
avoiding a catastrophe like this in the
future I bet you can think of some
others so in fact put those in the
comments below and everyone else can
learn from your ideas