Accidental Production Deletion: Lessons Learned
Key Points
- A careless “rm -f” run as root on the wrong terminal deleted the production server’s home directory, causing the system to go down.
- Deploying changes via a blue/green (or mirrored) strategy allowed the faulty server to be taken out of rotation and the service restored quickly using the untouched replicas.
- Sharing root passwords and giving developers unrestricted admin access creates dangerous shortcuts; instead, access should be scoped to specific groups or roles.
- Implementing privileged‑access management provides granular, auditable permissions, eliminating the need for direct root logins while still allowing necessary high‑level tasks.
Sections
- Root Deletion Mistake on Production - A developer inadvertently ran a dangerous “rm -f” command in the production environment after switching to root, wiping the home directory and causing an outage, underscoring the importance of clear environment separation, verification, and redundant mirrors.
- Secure Deployment Automation Practices - The speaker advises using privileged access management and limiting sudo privileges while fully automating deployments with scripts that handle authentication, testing, and rollback to avoid costly mistakes and embarrassment.
Full Transcript
# Accidental Production Deletion: Lessons Learned **Source:** [https://www.youtube.com/watch?v=fR-nfp1DiTs](https://www.youtube.com/watch?v=fR-nfp1DiTs) **Duration:** 00:04:34 ## Summary - A careless “rm -f” run as root on the wrong terminal deleted the production server’s home directory, causing the system to go down. - Deploying changes via a blue/green (or mirrored) strategy allowed the faulty server to be taken out of rotation and the service restored quickly using the untouched replicas. - Sharing root passwords and giving developers unrestricted admin access creates dangerous shortcuts; instead, access should be scoped to specific groups or roles. - Implementing privileged‑access management provides granular, auditable permissions, eliminating the need for direct root logins while still allowing necessary high‑level tasks. ## Sections - [00:00:00](https://www.youtube.com/watch?v=fR-nfp1DiTs&t=0s) **Root Deletion Mistake on Production** - A developer inadvertently ran a dangerous “rm -f” command in the production environment after switching to root, wiping the home directory and causing an outage, underscoring the importance of clear environment separation, verification, and redundant mirrors. - [00:03:13](https://www.youtube.com/watch?v=fR-nfp1DiTs&t=193s) **Secure Deployment Automation Practices** - The speaker advises using privileged access management and limiting sudo privileges while fully automating deployments with scripts that handle authentication, testing, and rollback to avoid costly mistakes and embarrassment. ## Full Transcript
Welcome to "Lessons Learned," a series where we share our biggest mistakes so you don't make the same ones.
Today's lessons come a time when I was a web developer deploying a fix to a production system.
So I was working on a fix on my laptop and I uploaded it to a test system.
And I verify the works as expected.
So then, following a runbook, I then upload it to the production system.
And again, I verify that works as expected. So now I'm ready to move on to my next task and I want to clean up.
So I do what I think is a pretty reasonable command: "rm -f" to remove my test files that I've been using.
And inexplicably I get a warning that says "you didn't have permission to do that".
Hmm.
Okay.
Well, this is where I made my big mistake.
I switched to root user and I go ahead and delete anyways with the same command.
Well, moments later, the production system goes down.
What I had done is, I had clicked on the production window when I really thought that I was on the test system.
And I thought that I was in some distant directory, where in fact, I was on the home directory of the production system.
So I deleted that and, of course, system went down.
The good news is, fortunate for me, is that we had more than one server.
We had production 2, production 3.
And so I took this one out of rotation and the production system came back up.
So what lessons did I learn from that?
Well, the first, of course, is that mirrors can save you a lot of embarrassment.
And today we refer to that as a blue/green deployment where you first deploy to just a subset of your servers
and then roll out to more and more servers as you become more comfortable that the change you made is in fact legitimate.
And if there is a problem, you can roll it back very quickly and not affect a large swath of servers.
But probably the more important lesson here is having role separation.
Now, perhaps it was a question of convenience, maybe a little bit of laziness, is that, at that time,
we tended to share passwords to some of these administrator accounts.
And part of the reason we did that is that the administrator didn't really want to be bothered with, you know, apply this small change for me.
"Here, just give me that one password and I'll take care and then I'll be out of your way."
But it would be a lot smarter to use some other mechanism and at a minimum, for example, to use groups.
So that way you don't have a root user, but you have a specific group of users
that can do specific functionalities within certain sections of your production environment.
But even better is having privilege access management.
And we have a video on that in case you want to learn more, but the idea is there, is that you eliminate the need to have root access for a developer at all.
And instead [it] provides a convenient way of being able to share some of these high-level functions with different users, but having traceability.
So with privilege access management, you can have a login that instead of being a direct login,
is going through a separate system that manages the passwords among multiple users for a shared user ID.
Of course, that then eliminates root access.
I would also argue that you really should limit sudo access because you might be tempted to do that as well,
which is an idea where you can change a root user for just one command.
That's a little bit better.
But still, a developer should not be really having that level of authority.
Really, the biggest lesson to come from this is you need to automate your deployment.
Now, in my case I was using a runbook, but in retrospect, it would have been a lot smarter to have a script that automatically deploys to production.
It has all the appropriate logins it needs to be able to do that.
It can test to make sure that the changes have applied correctly--and roll them back if it sees that some of the systems are failing.
And then if you do that, you can hopefully avoid the embarrassment that I suffered.