Learning Library

← Back to Library

Avoiding Uncontrolled Container Scaling Costs

Key Points

  • The main issue discussed is “scaling gone wild,” where improperly configured auto‑scaling policies cause excess worker nodes to remain active, leading to unexpectedly high costs.
  • Critical microservices (e.g., load balancers, monitoring, logging) are often deployed onto these nodes, preventing the cluster from scaling down because the services are marked as essential.
  • Proper configuration of auto‑scaling policies is the first step, ensuring the cluster can expand for peak events (like Black Friday) and contract when demand drops.
  • Comprehensive observability—capturing telemetry from the application layer through the underlying infrastructure—is needed to monitor resource usage and confirm that scaling actions are appropriate.
  • Automated alerting (via email, Slack, SMS, etc.) must be set up so that the right teams are notified instantly, enabling rapid response to scaling anomalies before they inflate costs.

Full Transcript

# Avoiding Uncontrolled Container Scaling Costs **Source:** [https://www.youtube.com/watch?v=HDTqhqaF8L8](https://www.youtube.com/watch?v=HDTqhqaF8L8) **Duration:** 00:08:52 ## Summary - The main issue discussed is “scaling gone wild,” where improperly configured auto‑scaling policies cause excess worker nodes to remain active, leading to unexpectedly high costs. - Critical microservices (e.g., load balancers, monitoring, logging) are often deployed onto these nodes, preventing the cluster from scaling down because the services are marked as essential. - Proper configuration of auto‑scaling policies is the first step, ensuring the cluster can expand for peak events (like Black Friday) and contract when demand drops. - Comprehensive observability—capturing telemetry from the application layer through the underlying infrastructure—is needed to monitor resource usage and confirm that scaling actions are appropriate. - Automated alerting (via email, Slack, SMS, etc.) must be set up so that the right teams are notified instantly, enabling rapid response to scaling anomalies before they inflate costs. ## Sections - [00:00:00](https://www.youtube.com/watch?v=HDTqhqaF8L8&t=0s) **Scaling Gone Wild: Stuck Nodes** - The expert explains that misconfigured auto‑scaling policies and the deployment of critical microservices onto worker nodes prevent those nodes from being terminated, resulting in unexpectedly high costs. - [00:03:04](https://www.youtube.com/watch?v=HDTqhqaF8L8&t=184s) **Right Tool for Role** - The speaker uses a hammer analogy to argue that assigning developers tools like a Kubernetes cluster—unsuitable for their expertise—reduces productivity, highlighting the importance of aligning tooling and responsibilities with each role. - [00:06:15](https://www.youtube.com/watch?v=HDTqhqaF8L8&t=375s) **Kubernetes Resource Limits & Scaling** - The speakers explain that perpetual crashes stem from missing CPU/memory limits in pods, emphasizing the need for proper resource planning, a holistic view of all microservices, and coordinated horizontal/vertical autoscaling that respects the capacity of the underlying physical infrastructure. ## Full Transcript
0:00Welcome to Lessons Learned. 0:01We have a special edition today which is on container problems. 0:05We're going to explain what had happened and how you could potentially avoid them. 0:09With us today is an expert on it, Chris Rosen. 0:11OK, Container Expert, present our first problem, 0:15"Scaling gone wild." What exactly is the story behind this? 0:19So envision that your application is successful. 0:22That's a good thing. 0:23All of us want our applications, our tools to be successful and grow. 0:28So when we scale up, that's accommodating the resource requirements required to run these workloads. 0:34That's a good thing. 0:35So far so good. 0:36We are adding resources, worker nodes to that cluster. 0:40However, at some point we start to incur a large bill 0:44because those resources are no longer required 0:47and we're not automatically scaling them back down. 0:51I see, so we want these to go away at some point, but they're not. 0:54And presumably there's a cause. 0:55What's behind that cause? 0:57So the cause generally is that 1:00we've not configured the auto-scaling policy properly 1:03and we're deploying critical microservices on to these worker nodes that are service-level. 1:09Maybe that's your application load balancer, 1:11maybe it's monitoring and logging. 1:13But when we do that, when we deploy these microservices to those worker nodes, 1:18we can't automatically delete them because they are critical microservices to run that cluster. 1:25I see, so you can't scale back down to less 1:28because these are critical services and are marked as such. 1:32Exactly. 1:32Got it. 1:33OK, so if using the correct configuration is step one, what else could they have done? 1:38So clearly, like you said, step one is setting the right auto-scaling policy 1:43so that way we can meet the demand for a Black Friday event, 1:47for a weather event, something else that's going to drive unexpected capacity. 1:54But we also want to set it so that way we can not deploy those critical components and scale back down. 2:00So the configuration is very important. 2:02Now, how do we monitor, how do we get the insights, the telemetry to how those applications are performing? 2:10And that's where observability comes into play. 2:12Because in this new container world, 2:15we want insights throughout the entire stack, infrastructure all the way up to our containers. 2:20So we want to make sure they have the resources that are required for them, but not too much. 2:26And that's how we monitor insight to the cluster and scale back down. 2:29Well, what observability really buys you is, is that you get a single thread 2:34of information all the way from the application transaction to the infrastructure it's running on. 2:39But you also going to need something else. 2:41Exactly. 2:42Because no one is sitting around watching the monitoring or the logging dashboards for these events to take place. 2:49So that's where alerting and custom alerts, whether it's email, a Slack integration, a text message, 2:56we want to alert the right teams so that way they can come in and take the right action immediately 3:02and circumvent the problem that is building. 3:05Excellent, so that's our first one. Let's go on to our second one. 3:09OK, that was lesson one, now on to lesson two. 3:11The problem is "I've got a hammer and ..." 3:15I love this example because when we think about 3:18one size fitting all and a hammer being the one tool to solve what you're trying to accomplish. 3:24So as it relates to our container management, 3:27it's that the developer persona is given a tool that is not purpose-fit for their job. 3:33So if we give them the wrong solution, it's going to really drop their productivity. 3:39Because instead of them looking for the right tool, 3:42they're trying to force the wrong tool for that particular situation. 3:45So they were given Kubernetes cluster, for example. 3:47Why would that be the wrong tool? 3:49Because the developer-- say, for example, I'm a front-end developer --I don't 3:53want to learn how to deploy it, manage the lifecycle of my community's cluster, 4:00I want some abstraction from it so that way I can focus on what's important to me, which is writing code. 4:07That's going to be my value-add to the business. 4:09And that Kubernetes then can be monitored by an administrator, for example. 4:14Exactly. 4:14So the administrator that has the right skills in Kubernetes 4:19can create and manage that cluster, thinking about the line of responsibility. 4:24They'll run the cluster and I can focus on application development. 4:28And that kind of brings us to this first point, doesn't it? 4:31Exactly. 4:32It comes down to roles and responsibilities. 4:34Being very prescriptive in the amount of access and controls to what you can do within that cluster. 4:41Ensuring that I'm doing things to manage and run the cluster. 4:45You deal with it in application code level. 4:48So creating those boundaries will really accelerate our utilization of the tool, which is a Kubernetes cluster in this case. 4:56In fact, one of the things that is my pet peeve, is that as developer, 4:59I spend too much time having to learn new tools or new processes. 5:04I spend 80% of my time there. 5:06Where really I want to spend 80% of my time on code and as little as possible. 5:13Right, so we want to flip that. 5:15We want our developers to spend 80% or more of their time writing code. 5:20That's what they want to do. 5:21They don't want to learn these new tools. 5:23So when we think about the hammer analogy, 5:25the Kubernetes cluster was not the right solution for that persona, the developer. 5:30Let's abstract them, give them access to the tools that they're familiar with, 5:34the CI/CD tools to integrate, push code, 5:38and it all comes back to the right container management strategy, 5:42creating the boundaries, giving the right users the right tools to be efficient at their jobs. 5:47Excellent. 5:47Hey, by the way, if you haven't seen Chris's video on container strategy, be sure and check it out. 5:52It'll be right here. 5:54OK, for our last lessons learned for containers we have "I've fallen" and something's gone wrong. 6:01Exactly. 6:02So the problem is that our pods, our containers, have fallen or crashed. 6:09So then Kubernetes is smart enough to redeploy those, but then it happens again and again and again. 6:16So we need to really understand what is causing that continuous process to take place. 6:21So it's not just about managing your specific resource, but a continual failure, essentially. 6:26Okay, great. 6:27So we understand the problem. 6:30What could cause that? 6:31Generally, in Kubernetes, 6:33it's because we deploy those applications, those pods, 6:38without setting the right resource limits. 6:40So think about: you've deployed your application, but you've not allocated enough CPU or memory. 6:45So eventually you're going to consume all that you've been allocated and all you can do is crash. 6:51And I can also see that happening where if you do it in your development environment 6:55and then you go to production, the demands may well be different. 6:58So you really need to plan for that, right? 7:01Exactly. Because real life will drive additional resource requirements 7:07that maybe you've not thought about in your development cycle. 7:10So it does come down to planning. 7:12And you can see here it's the entire stack. 7:14It's what resources will each of my microservices or components in that containerized application require. 7:21So that's when we think about holistically as an application, 7:24we think about individual containers, 7:26and extremely important is to think about the underlying infrastructure 7:30because we could set horizontal and vertical scaling policies within Kubernetes, 7:36but eventually we'll run out of capacity within the physical infrastructure. 7:41So then we need an auto-scaling policy to scale out and accommodate that growth in the workload. 7:46And when you are trying to be accountable to that, this comes into knowing it's going to happen. 7:52Exactly. 7:52It all comes down to having the insights, those in-depth telemetry, again, observability, metrics, logs. 8:00How do we understand how each of these layers are performing 8:04and where are the bottlenecks, 8:05where do we allocate additional resources? 8:08It's really a classic performance optimization pattern of being able to plan, observe and finally adjust. 8:15Adjusting is critical because we can do all of the planning and the forecasting, 8:20but it really comes down to once we deploy that workload, how do we observe it 8:24and then come back to adjust. 8:25With our next deployment 8:27using Kubernetes red/green, I mean blue/green, red/black deployment strategies, 8:32we can roll out and ensure that we've got the right capacity required. 8:36Well, thanks, Chris. 8:37That was excellent. 8:38Before you leave, don't forget to leave us some comments, 8:41if there are problems that we haven't discussed and maybe we'll do that on the next lessons learned. 8:47Thanks for watching! 8:48Before you leave, hey, don't forget to hit like and subscribe.