Learning Library

← Back to Library

CrowdStrike Rollout Failure Exposes Testing Flaws

6m • Unknown Channel • security • deep-dive • intermediate • Watch on YouTube ↗

Key Points

CrowdStrike’s recent massive outage was traced to fundamental procedural failures, including testing only in staging environments instead of production.
The rapid, simultaneous deployment lacked a rollback mechanism, turning the update into a “one‑way door” that left affected machines bricked and unable to receive OTA fixes.
No canary or phased rollout was performed, missing an opportunity to catch bugs in a small, live subset before global release.
The incident highlights a broader need for higher standards from CTOs and third‑party risk assessments, especially for Windows‑based infrastructure.
The situation presents an industry‑wide opportunity for Microsoft and the tech community to improve structured deployment testing and risk evaluation processes.

Sections

00:00:00 CrowdStrike Staging Test Oversight - The speaker criticizes CrowdStrike’s reliance on staging‑only testing that caused a major outage, argues CTOs must demand stricter deployment validation, and suggests Microsoft could supply ecosystem‑wide Windows testing tools.

Full Transcript

# CrowdStrike Rollout Failure Exposes Testing Flaws **Source:** [https://www.youtube.com/watch?v=404cxJdCinA](https://www.youtube.com/watch?v=404cxJdCinA) **Duration:** 00:06:48 ## Summary - CrowdStrike’s recent massive outage was traced to fundamental procedural failures, including testing only in staging environments instead of production. - The rapid, simultaneous deployment lacked a rollback mechanism, turning the update into a “one‑way door” that left affected machines bricked and unable to receive OTA fixes. - No canary or phased rollout was performed, missing an opportunity to catch bugs in a small, live subset before global release. - The incident highlights a broader need for higher standards from CTOs and third‑party risk assessments, especially for Windows‑based infrastructure. - The situation presents an industry‑wide opportunity for Microsoft and the tech community to improve structured deployment testing and risk evaluation processes. ## Sections - [00:00:00](https://www.youtube.com/watch?v=404cxJdCinA&t=0s) **CrowdStrike Staging Test Oversight** - The speaker criticizes CrowdStrike’s reliance on staging‑only testing that caused a major outage, argues CTOs must demand stricter deployment validation, and suggests Microsoft could supply ecosystem‑wide Windows testing tools. ## Full Transcript

0:01so crowd strikes error correction 0:04document dropped that means that they're 0:05looking into the root cause of what 0:07happened with the massive crowd strike 0:08failure last week and they're trying to 0:11understand what is the root cause and 0:13why I want to talk about the fact that 0:15there are some absolutely glaring issues 0:18that they are saying they are going to 0:19correct that should never ever have been 0:22there in the first place and I think 0:24that as I read into this more I am 0:28recognizing that we need to expect Chief 0:30technical officers at client companies 0:33to hold a higher standard when it comes 0:35to the kinds of software they're willing 0:37to deploy on their systems so we're 0:39going to go into six related issues I 0:40think you're going to see what I mean so 0:42number one crowd strike did not test on 0:47crowd strike production 0:49deployments I just want to take a minute 0:51for that they tested in a staging 0:54environment this is for something that 0:55is going to roll out globally and they 0:58they just stopped it testing and 1:00staging and apparently that was a normal 1:03thing and they say they're going to fix 1:05that they should never ever have been 1:10only testing for that kind of a change 1:12in 1:14staging I'll also 1:17add and and this is sort of a little bit 1:19to the defense of Chief technical 1:21officers or clients who obviously are 1:22super busy have a lot on their plates 1:25there's not really a structured way to 1:27evaluate these kinds of deployments on 1:30Windows machines and that's something 1:33Microsoft could take on as a fix for the 1:35ecosystem as a whole since so much of 1:38our Global infrastructure does rely on 1:40Windows how can we get better as a Tech 1:45Community at providing reliable 1:47third-party risk assessments of 1:49particular deploys I don't have an 1:51answer for that but it feels like an 1:53opportunity so the third thing I want to 1:55call out is crowd strike deployed fast 1:59they deployed simultaneously they did 2:01not have an option to roll back once 2:03there was a bug that bricked the 2:05machines or generated a blue screen of 2:07death and so that was a one-way door 2:10they were deploying in a one-way door 2:13Fashion on the assumption that either 2:16there would be no bug or they would 2:18always be able to roll back and that 2:20just isn't always a valid assumption 2:23because certain kinds of bugs do exactly 2:25what we saw last week where the machine 2:27is no longer updatable over the air 2:31but crowd strike never anticipated that 2:34apparently they never anticipated that 2:36kind of bug could exist which is all the 2:38more astonishing because their CEO 2:40previously had a very similar bug happen 2:43at McAfee this was back in I think 2:462010 and this is related to sort of a 2:50fourth issue I want to call out they had 2:52no Canary testing and they had no phased 2:54roll out that's kind of the correct 2:56remedy that's the best practice that you 2:57could use a canary test you're testing 2:59on a very small subsection of users in 3:01live environments this would have been 3:03caught rolled phase out is a little bit 3:06more in this context but at least you're 3:08rolling it out and not just assuming 3:10that it works for everybody at once by 3:12launching to the entire world and so a 3:15phased roll out you're sort of turning 3:17the dial up to half a percent of the 3:19total footprint then 1% if it works then 3:2210% so you sort of roll it out and a 3:25canary test you're actually just saying 3:26here's 50 machines with crowd strike 3:28installed at actual customer locations 3:30let's see what happens I think both of 3:33those approaches would have caught 3:35this I think canar is probably slightly 3:38stronger and would have minimized the 3:39impact of the footprint 3:41regardless none of it was tried none of 3:44it was tried it was just like let's hit 3:45production right like let's go buy you 3:47only live 3:48once it's absolutely astonishing given 3:51what they were 3:52changing they should not have been able 3:54to write to colel and then change like 3:57that it's horrifying 4:00okay the the last thing I want to call 4:03the sixth thing is oh I guess it's no 4:06it's the fifth thing there will be a 4:08sixth thing losing track here so the 4:10fifth thing I want to call it is there's 4:11no control of deploy for clients 4:14typically if it's this important to 4:16deploy I would expect that CTO would be 4:18able to have some governance over how 4:20deploys are enacted they should be able 4:23to assess the risk to their systems and 4:24say for example I do not want to be in 4:27the first tranch of deployment I do not 4:29want want to be on the front wave of 4:30deployment this is a production system 4:32perhaps I'm the Delta CTO this is 4:35managing all of my flight Crew 4:37Scheduling I do not want to have this 4:39computer updated over the air when I'm 4:41not looking at it I want it updated 4:43after everything is validated after 4:45production testing has occurred at the 4:46end of your phased roll out and I want 4:48to make sure it's updated in a way I can 4:50predict so for example never push it on 4:52Fridays none of that optionality was 4:55available to clients that should be 4:58fixed 5:00the sixth thing now we're at the sixth 5:02thing I I cannot believe that I have to 5:05say this out loud there was no release 5:09noting going on they say they're going 5:11to add release notes everybody does 5:15release notes I do not understand how 5:19they got to be a company this big 5:21deploying fixes like this and just 5:23decided not to do release 5:26notes 5:28how no idea apparently they they're 5:31thinking about starting which is a good 5:32thing to 5:34do I want to leave you with an Easter 5:36egg here at the end so apparently and 5:39we're learning this from clients we're 5:41learning this from partners of crowd 5:42strike they decided that their apology 5:45to maintain client relationships after 5:47this globally impacting bug that 5:49grounded Airlines and severely impacted 5:52hospitals and all the rest of it is to 5:54send out and I am not making this up $10 5:58Uber Eats gift 6:00cards but you think you you might be 6:03like pausing to laugh and like face pal 6:05there it gets 6:06worse they're $10 Uber Eats gift cards 6:09and clients and partners are saying they 6:11were 6:12impacted by Uber Eats grading these as 6:16fraudulent gift cards and so they can't 6:18even redeem their $10 gift cards that 6:21they got for this globally impacting 6:25outage I'll just leave you with that and 6:28I guess the Nugget is 6:30always always respect your clients and 6:33customers enough to think about what 6:35would happen if something you wrote 6:38actually caused a really severe bug for 6:40them otherwise you're going to end up in 6:42this boat there's a cultural issue at 6:44crowd strike and it needs to be 6:45addressed