Learning Library

← Back to Library

Accidental Production Deletion: Lessons Learned

Key Points

  • A careless “rm -f” run as root on the wrong terminal deleted the production server’s home directory, causing the system to go down.
  • Deploying changes via a blue/green (or mirrored) strategy allowed the faulty server to be taken out of rotation and the service restored quickly using the untouched replicas.
  • Sharing root passwords and giving developers unrestricted admin access creates dangerous shortcuts; instead, access should be scoped to specific groups or roles.
  • Implementing privileged‑access management provides granular, auditable permissions, eliminating the need for direct root logins while still allowing necessary high‑level tasks.

Full Transcript

# Accidental Production Deletion: Lessons Learned **Source:** [https://www.youtube.com/watch?v=fR-nfp1DiTs](https://www.youtube.com/watch?v=fR-nfp1DiTs) **Duration:** 00:04:34 ## Summary - A careless “rm -f” run as root on the wrong terminal deleted the production server’s home directory, causing the system to go down. - Deploying changes via a blue/green (or mirrored) strategy allowed the faulty server to be taken out of rotation and the service restored quickly using the untouched replicas. - Sharing root passwords and giving developers unrestricted admin access creates dangerous shortcuts; instead, access should be scoped to specific groups or roles. - Implementing privileged‑access management provides granular, auditable permissions, eliminating the need for direct root logins while still allowing necessary high‑level tasks. ## Sections - [00:00:00](https://www.youtube.com/watch?v=fR-nfp1DiTs&t=0s) **Root Deletion Mistake on Production** - A developer inadvertently ran a dangerous “rm -f” command in the production environment after switching to root, wiping the home directory and causing an outage, underscoring the importance of clear environment separation, verification, and redundant mirrors. - [00:03:13](https://www.youtube.com/watch?v=fR-nfp1DiTs&t=193s) **Secure Deployment Automation Practices** - The speaker advises using privileged access management and limiting sudo privileges while fully automating deployments with scripts that handle authentication, testing, and rollback to avoid costly mistakes and embarrassment. ## Full Transcript
0:00Welcome to "Lessons Learned," a series where we share our biggest mistakes so you don't make the same ones. 0:05Today's lessons come a time when I was a web developer deploying a fix to a production system. 0:11So I was working on a fix on my laptop and I uploaded it to a test system. 0:18And I verify the works as expected. 0:21So then, following a runbook, I then upload it to the production system. 0:27And again, I verify that works as expected. So now I'm ready to move on to my next task and I want to clean up. 0:33So I do what I think is a pretty reasonable command: "rm -f" to remove my test files that I've been using. 0:42And inexplicably I get a warning that says "you didn't have permission to do that". 0:47Hmm. 0:48Okay. 0:49Well, this is where I made my big mistake. 0:51I switched to root user and I go ahead and delete anyways with the same command. 0:56Well, moments later, the production system goes down. 1:00What I had done is, I had clicked on the production window when I really thought that I was on the test system. 1:10And I thought that I was in some distant directory, where in fact, I was on the home directory of the production system. 1:17So I deleted that and, of course, system went down. 1:21The good news is, fortunate for me, is that we had more than one server. 1:27We had production 2, production 3. 1:30And so I took this one out of rotation and the production system came back up. 1:36So what lessons did I learn from that? 1:39Well, the first, of course, is that mirrors can save you a lot of embarrassment. 1:48And today we refer to that as a blue/green deployment where you first deploy to just a subset of your servers 1:56and then roll out to more and more servers as you become more comfortable that the change you made is in fact legitimate. 2:03And if there is a problem, you can roll it back very quickly and not affect a large swath of servers. 2:09But probably the more important lesson here is having role separation. 2:17Now, perhaps it was a question of convenience, maybe a little bit of laziness, is that, at that time, 2:23we tended to share passwords to some of these administrator accounts. 2:28And part of the reason we did that is that the administrator didn't really want to be bothered with, you know, apply this small change for me. 2:34"Here, just give me that one password and I'll take care and then I'll be out of your way." 2:37But it would be a lot smarter to use some other mechanism and at a minimum, for example, to use groups. 2:46So that way you don't have a root user, but you have a specific group of users 2:51that can do specific functionalities within certain sections of your production environment. 2:57But even better is having privilege access management. 3:02And we have a video on that in case you want to learn more, but the idea is there, is that you eliminate the need to have root access for a developer at all. 3:13And instead [it] provides a convenient way of being able to share some of these high-level functions with different users, but having traceability. 3:22So with privilege access management, you can have a login that instead of being a direct login, 3:27is going through a separate system that manages the passwords among multiple users for a shared user ID. 3:35Of course, that then eliminates root access. 3:43I would also argue that you really should limit sudo access because you might be tempted to do that as well, 3:52which is an idea where you can change a root user for just one command. 3:55That's a little bit better. 3:57But still, a developer should not be really having that level of authority. 4:02Really, the biggest lesson to come from this is you need to automate your deployment. 4:09Now, in my case I was using a runbook, but in retrospect, it would have been a lot smarter to have a script that automatically deploys to production. 4:18It has all the appropriate logins it needs to be able to do that. 4:22It can test to make sure that the changes have applied correctly--and roll them back if it sees that some of the systems are failing. 4:30And then if you do that, you can hopefully avoid the embarrassment that I suffered.