Learning Library

← Back to Library

Site Reliability Engineering: Role and Automation

Key Points

  • Site Reliability Engineering (SRE) is a formally named discipline that blends traditional IT operations with modern DevOps practices, providing reliable service delivery beyond the developers’ responsibilities.
  • An SRE’s work is roughly split 50/50: half the time is spent responding to incidents, escalations, and customer problems, and the other half focuses on eliminating manual “toil” through automation.
  • Automating routine operational tasks does not jeopardize the SRE role; each automation effort delivers new system insights and creates opportunities for further automation, continuously improving reliability.
  • The SRE mindset treats operations like software development—writing code to programmatically resolve recurring issues—so that human intervention is minimized and systems become more self‑service.

Full Transcript

# Site Reliability Engineering: Role and Automation **Source:** [https://www.youtube.com/watch?v=ztIIcXNzMN4](https://www.youtube.com/watch?v=ztIIcXNzMN4) **Duration:** 00:08:11 ## Summary - Site Reliability Engineering (SRE) is a formally named discipline that blends traditional IT operations with modern DevOps practices, providing reliable service delivery beyond the developers’ responsibilities. - An SRE’s work is roughly split 50/50: half the time is spent responding to incidents, escalations, and customer problems, and the other half focuses on eliminating manual “toil” through automation. - Automating routine operational tasks does not jeopardize the SRE role; each automation effort delivers new system insights and creates opportunities for further automation, continuously improving reliability. - The SRE mindset treats operations like software development—writing code to programmatically resolve recurring issues—so that human intervention is minimized and systems become more self‑service. ## Sections - [00:00:00](https://www.youtube.com/watch?v=ztIIcXNzMN4&t=0s) **Introducing Site Reliability Engineering** - IBM Cloud product manager Bradley Knapp defines SRE as the modern, 50/50 blend of traditional IT operations and DevOps, explaining how it bridges legacy operational roles with developer responsibilities to ensure reliable service delivery. - [00:03:14](https://www.youtube.com/watch?v=ztIIcXNzMN4&t=194s) **SRE: Customer‑Facing Knowledge Hub** - The speaker explains that Site Reliability Engineers interact directly with customers, serve as the cross‑functional knowledge base for hardware, software, monitoring, logging, and automation, and continuously feed operational insights back to development while proactively identifying and automating solutions to inevitable failures. - [00:06:21](https://www.youtube.com/watch?v=ztIIcXNzMN4&t=381s) **SRE Mindset for Small Teams** - In small firms developers act as operators, applying SRE principles—anticipating failure, building redundancy, automating fixes, and monitoring—to achieve resilience without a dedicated SRE department. ## Full Transcript
0:01Thank you for joining us today! 0:02My name is Bradley Knapp, and I'm one of the product managers 0:04here at IBM Cloud 0:06and we've come to answer the question: 0:08what is Site Reliability Engineering, or SRE? 0:12And SRE is really the name for a new discipline  that's actually an old discipline. 0:17It's a new name, it's only been around 15, 18 years, 0:20but the job itself has been around for a very long time. 0:23It's just evolved over time, 0:25and now we've given a formal name to the discipline and the job. 0:29And so, the question is, what is SRE,  what is site reliability engineering? 0:35And so, the way that I like to describe it 0:38is that it's really the collision of the  traditional IT role and DevOps, right? 0:45So, back in the day, in the traditional IT  role, you would think about 0:48lots of people sitting in an operations center staring at very  large screens, 0:52kind of arranged in a semicircle, like a mission center, or a  watch center in the military. 0:57Well, that world doesn't so much exist anymore, 1:01and in the new world, in the DevOps cycle 1:04that everyone should be embracing for their software releases, 1:07you still have to have reliability. 1:10Your developers are still going to engineer  the software to be reliable, 1:14but when it comes to actually operating it actually delivering  the service that goes out to the end customer, 1:19that's really kind of outside of the responsibility of those software developers. 1:24That's where SRE comes in. 1:26An SRE is what I like to call a 50/50 role, right? 1:30SREs should spend about 50% of their time 1:34focusing on solving customer issues. 1:36That can be escalations, could be responding to incidents, 1:40dealing with an upset customer who  needs help on a tactical problem. 1:45That's going to be 50, and then the other 50% 1:48is maybe the most important part, and  that's every SRE should be actively 1:53trying to automate themselves out of a job.  They want to automate all of the things. 1:58The buzzword for this is reducing toil  , right? Reducing all of the manual work 2:04necessary to keep any kind of  software environment up and running. 2:08This includes the hardware itself,  it includes all of the middleware, 2:11it includes the software - all of the related  services you have to keep these things live. 2:17And so, the question then becomes: all right,  well, we're going to automate these things, 2:22isn't that putting my job at risk  if we get rid of these manual tasks? 2:26And the answer is: in reality, no it's not.  It's never going to put your job at risk, 2:32because every time you automate something,  you gain some additional insight 2:36into the system. Every time you automate  something, you learn something new, 2:39and you identify additional tasks that  you'll be able to automate in the future. 2:44And so, automation is core. It's approaching  operations with a development mindset, 2:50because you want to programmatically  solve problems so that you don't have 2:54to go in and make the same manual  fix time after time after time. 2:58This is key to the SRE role, and  it's key to your success in it. 3:03And so that other 50% of the time I talked  about that before right, that's going to be 3:07escalations. It's going to be on-call work or, 3:10in some cases, for a large enough  organization, SRE might be 24-7. 3:15It's going to include customer facing work, right?  You are going to have to interact with customers, 3:21and it's going to include being the  source of knowledge for your group. 3:28Because SRE crosses all boundaries: it  knows about hardware, it knows about 3:32software, it knows about monitoring, it knows  about logging, it knows about automation. 3:37And so, they understand all of the different  components. They have the institutional knowledge 3:43of how to keep the product up  and running as a product manager. 3:46I like to make the joke that when I want  to know how software's designed to run, 3:51I go, and I ask the developers who wrote it.  When I want to know how it actually runs, 3:55I go, and I ask SRE because they're the ones who  get to deal with the implementation every day. 4:01And so, bridging the gap between what  actually happens and what we want to happen, 4:06that's so important to the SRE job because  they have day-to-day hands-on interaction 4:12with how people actually use the product. 4:15So, SRE is constantly feeding data back  into development so that development can 4:21make the software better, at the same time that  they're automating in all of the resiliency. 4:26SRE understands that failure will happen. 4:30Failure is just the nature of business.  You cannot design a perfect system. 4:35And so, what SRE excels at is  programmatically identifying 4:39potential failures and solving them ahead  of time, and it's also good at identifying 4:45how are we going to solve  immediate tactical problems. 4:49And so, I talked a minute ago about monitoring,  right? That traditional IT room with all of the 4:55screens. Well, monitoring and logging  are just key to the SRT role, SRE role. 5:02So SREs, as they monitor, they're keeping track  of what's happening in real time. Logging is an 5:08archive of everything that's happened, so  that you can go back and examine it later. 5:13So, your monitoring is going to give  you the ability to anticipate failures 5:17and see them coming so that  you can proactively solve them. 5:21Logging is when you get an unanticipated  failure. It allows you to go back 5:25see what happened. You can do a an RCA, a Root  Cause Analysis , on it and figure out how to 5:32solve it, not just for now, but for the future.  That gets back into the automation again, right? 5:38If you know what happened,  and you know why it happened, 5:41you can then adjust that monitoring that we  were talking about, so that the monitoring 5:45itself will catch this edge case and you  don't encounter that failure ever again. 5:52So, SRE is just core to a successful business,  and most companies will find they have a role 5:59pretty similar to SRE today in the world  of software in the world of technology it's 6:03something that we already have, even  though we may not be calling it SRE, 6:07but if you're talking to a startup, a very  young company, they're going to say, well, 6:12you know we don't have the budget to go out  and develop an SRE organization to start with, 6:16right? We only have 25 employees, we  only have 30 employees , and that's okay. 6:22The important part of SRE for a small company is  not so much having someone with that job title, 6:28because your developers are your operators at  that point. It's engineering everything they do 6:34with that SRE mindset: that failure is an option  and, as a matter of fact, should be predicted for, 6:40but is something that you can automate  to solve. It's something that you can 6:44create enough redundancy that,  when failure does happen, 6:47it's not a big deal because you're  resilient enough that nothing goes down. 6:52And so, as long as you develop with that SRE  mindset in mind, and you are being resilient, 6:58you're being redundant, you are constantly going  back and automating problems so that you don't 7:02have to manually fix the same thing over and over  and over again, and you're doing good root cause 7:08analysis on actual failures so that they don't  happen again, and you're monitoring so that you 7:14will know when they're about to happen and you can  head it off at the pass - that's really the key. 7:19Large organizations, they can afford an entire  SRE department. They can stand it up, or they 7:24can transition an existing operations group into  it by empowering that operations group. Again 7:30that 50/50 rule, spending  half their time automating, 7:33half their time fixing problems, and automating  all of the things. Automate everything, because 7:39the less manual work and manual intervention you  have the happier that SRE team is going to be. 7:47Thank you so much for your time today. If you  have any questions, please drop us a line below. 7:52If you want to see more videos like this  in the future, please do like and subscribe 7:57and let us know. And don't forget: you can grow  your skills and earn a badge with IBM Cloud Labs, 8:03which are free, browser-based  interactive Kubernetes labs, 8:07that you can find more information  on by looking below. Thanks again!