Learning Library

← Back to Library

CrowdStrike Rollout Failure Exposes Testing Flaws

Key Points

  • CrowdStrike’s recent massive outage was traced to fundamental procedural failures, including testing only in staging environments instead of production.
  • The rapid, simultaneous deployment lacked a rollback mechanism, turning the update into a “one‑way door” that left affected machines bricked and unable to receive OTA fixes.
  • No canary or phased rollout was performed, missing an opportunity to catch bugs in a small, live subset before global release.
  • The incident highlights a broader need for higher standards from CTOs and third‑party risk assessments, especially for Windows‑based infrastructure.
  • The situation presents an industry‑wide opportunity for Microsoft and the tech community to improve structured deployment testing and risk evaluation processes.

Full Transcript

# CrowdStrike Rollout Failure Exposes Testing Flaws **Source:** [https://www.youtube.com/watch?v=404cxJdCinA](https://www.youtube.com/watch?v=404cxJdCinA) **Duration:** 00:06:48 ## Summary - CrowdStrike’s recent massive outage was traced to fundamental procedural failures, including testing only in staging environments instead of production. - The rapid, simultaneous deployment lacked a rollback mechanism, turning the update into a “one‑way door” that left affected machines bricked and unable to receive OTA fixes. - No canary or phased rollout was performed, missing an opportunity to catch bugs in a small, live subset before global release. - The incident highlights a broader need for higher standards from CTOs and third‑party risk assessments, especially for Windows‑based infrastructure. - The situation presents an industry‑wide opportunity for Microsoft and the tech community to improve structured deployment testing and risk evaluation processes. ## Sections - [00:00:00](https://www.youtube.com/watch?v=404cxJdCinA&t=0s) **CrowdStrike Staging Test Oversight** - The speaker criticizes CrowdStrike’s reliance on staging‑only testing that caused a major outage, argues CTOs must demand stricter deployment validation, and suggests Microsoft could supply ecosystem‑wide Windows testing tools. ## Full Transcript
0:01so crowd strikes error correction 0:04document dropped that means that they're 0:05looking into the root cause of what 0:07happened with the massive crowd strike 0:08failure last week and they're trying to 0:11understand what is the root cause and 0:13why I want to talk about the fact that 0:15there are some absolutely glaring issues 0:18that they are saying they are going to 0:19correct that should never ever have been 0:22there in the first place and I think 0:24that as I read into this more I am 0:28recognizing that we need to expect Chief 0:30technical officers at client companies 0:33to hold a higher standard when it comes 0:35to the kinds of software they're willing 0:37to deploy on their systems so we're 0:39going to go into six related issues I 0:40think you're going to see what I mean so 0:42number one crowd strike did not test on 0:47crowd strike production 0:49deployments I just want to take a minute 0:51for that they tested in a staging 0:54environment this is for something that 0:55is going to roll out globally and they 0:58they just stopped it testing and 1:00staging and apparently that was a normal 1:03thing and they say they're going to fix 1:05that they should never ever have been 1:10only testing for that kind of a change 1:12in 1:14staging I'll also 1:17add and and this is sort of a little bit 1:19to the defense of Chief technical 1:21officers or clients who obviously are 1:22super busy have a lot on their plates 1:25there's not really a structured way to 1:27evaluate these kinds of deployments on 1:30Windows machines and that's something 1:33Microsoft could take on as a fix for the 1:35ecosystem as a whole since so much of 1:38our Global infrastructure does rely on 1:40Windows how can we get better as a Tech 1:45Community at providing reliable 1:47third-party risk assessments of 1:49particular deploys I don't have an 1:51answer for that but it feels like an 1:53opportunity so the third thing I want to 1:55call out is crowd strike deployed fast 1:59they deployed simultaneously they did 2:01not have an option to roll back once 2:03there was a bug that bricked the 2:05machines or generated a blue screen of 2:07death and so that was a one-way door 2:10they were deploying in a one-way door 2:13Fashion on the assumption that either 2:16there would be no bug or they would 2:18always be able to roll back and that 2:20just isn't always a valid assumption 2:23because certain kinds of bugs do exactly 2:25what we saw last week where the machine 2:27is no longer updatable over the air 2:31but crowd strike never anticipated that 2:34apparently they never anticipated that 2:36kind of bug could exist which is all the 2:38more astonishing because their CEO 2:40previously had a very similar bug happen 2:43at McAfee this was back in I think 2:462010 and this is related to sort of a 2:50fourth issue I want to call out they had 2:52no Canary testing and they had no phased 2:54roll out that's kind of the correct 2:56remedy that's the best practice that you 2:57could use a canary test you're testing 2:59on a very small subsection of users in 3:01live environments this would have been 3:03caught rolled phase out is a little bit 3:06more in this context but at least you're 3:08rolling it out and not just assuming 3:10that it works for everybody at once by 3:12launching to the entire world and so a 3:15phased roll out you're sort of turning 3:17the dial up to half a percent of the 3:19total footprint then 1% if it works then 3:2210% so you sort of roll it out and a 3:25canary test you're actually just saying 3:26here's 50 machines with crowd strike 3:28installed at actual customer locations 3:30let's see what happens I think both of 3:33those approaches would have caught 3:35this I think canar is probably slightly 3:38stronger and would have minimized the 3:39impact of the footprint 3:41regardless none of it was tried none of 3:44it was tried it was just like let's hit 3:45production right like let's go buy you 3:47only live 3:48once it's absolutely astonishing given 3:51what they were 3:52changing they should not have been able 3:54to write to colel and then change like 3:57that it's horrifying 4:00okay the the last thing I want to call 4:03the sixth thing is oh I guess it's no 4:06it's the fifth thing there will be a 4:08sixth thing losing track here so the 4:10fifth thing I want to call it is there's 4:11no control of deploy for clients 4:14typically if it's this important to 4:16deploy I would expect that CTO would be 4:18able to have some governance over how 4:20deploys are enacted they should be able 4:23to assess the risk to their systems and 4:24say for example I do not want to be in 4:27the first tranch of deployment I do not 4:29want want to be on the front wave of 4:30deployment this is a production system 4:32perhaps I'm the Delta CTO this is 4:35managing all of my flight Crew 4:37Scheduling I do not want to have this 4:39computer updated over the air when I'm 4:41not looking at it I want it updated 4:43after everything is validated after 4:45production testing has occurred at the 4:46end of your phased roll out and I want 4:48to make sure it's updated in a way I can 4:50predict so for example never push it on 4:52Fridays none of that optionality was 4:55available to clients that should be 4:58fixed 5:00the sixth thing now we're at the sixth 5:02thing I I cannot believe that I have to 5:05say this out loud there was no release 5:09noting going on they say they're going 5:11to add release notes everybody does 5:15release notes I do not understand how 5:19they got to be a company this big 5:21deploying fixes like this and just 5:23decided not to do release 5:26notes 5:28how no idea apparently they they're 5:31thinking about starting which is a good 5:32thing to 5:34do I want to leave you with an Easter 5:36egg here at the end so apparently and 5:39we're learning this from clients we're 5:41learning this from partners of crowd 5:42strike they decided that their apology 5:45to maintain client relationships after 5:47this globally impacting bug that 5:49grounded Airlines and severely impacted 5:52hospitals and all the rest of it is to 5:54send out and I am not making this up $10 5:58Uber Eats gift 6:00cards but you think you you might be 6:03like pausing to laugh and like face pal 6:05there it gets 6:06worse they're $10 Uber Eats gift cards 6:09and clients and partners are saying they 6:11were 6:12impacted by Uber Eats grading these as 6:16fraudulent gift cards and so they can't 6:18even redeem their $10 gift cards that 6:21they got for this globally impacting 6:25outage I'll just leave you with that and 6:28I guess the Nugget is 6:30always always respect your clients and 6:33customers enough to think about what 6:35would happen if something you wrote 6:38actually caused a really severe bug for 6:40them otherwise you're going to end up in 6:42this boat there's a cultural issue at 6:44crowd strike and it needs to be 6:45addressed