Learning Library

← Back to Library

Automating Server Deployment with Orchestrators

Key Points

  • Deploying the same application manually on multiple servers requires individual logins, installations, and troubleshooting, making the process error‑prone and inefficient.
  • A workload orchestrator automates the entire lifecycle—describing required resources, handling deployment, scaling, and resiliency—eliminating the need for human intervention.
  • When a server or job fails, the orchestrator automatically detects the issue, restores the workload to its prior state, and treats the event as routine rather than a crisis.
  • While Kubernetes is a popular orchestration platform, it often involves numerous interdependent components (e.g., ConfigMaps, Secrets, storage) and complex YAML configurations, which can add overhead compared to simpler workload orchestration solutions.

Full Transcript

# Automating Server Deployment with Orchestrators **Source:** [https://www.youtube.com/watch?v=YsEnqWnZcME](https://www.youtube.com/watch?v=YsEnqWnZcME) **Duration:** 00:11:33 ## Summary - Deploying the same application manually on multiple servers requires individual logins, installations, and troubleshooting, making the process error‑prone and inefficient. - A workload orchestrator automates the entire lifecycle—describing required resources, handling deployment, scaling, and resiliency—eliminating the need for human intervention. - When a server or job fails, the orchestrator automatically detects the issue, restores the workload to its prior state, and treats the event as routine rather than a crisis. - While Kubernetes is a popular orchestration platform, it often involves numerous interdependent components (e.g., ConfigMaps, Secrets, storage) and complex YAML configurations, which can add overhead compared to simpler workload orchestration solutions. ## Sections - [00:00:00](https://www.youtube.com/watch?v=YsEnqWnZcME&t=0s) **Automating Multi-Server Application Deployment** - The speaker contrasts the tedious manual process of logging into each VM to install and troubleshoot an app with using a workload orchestrator that automates deployment, scaling, and resiliency across multiple servers. - [00:03:07](https://www.youtube.com/watch?v=YsEnqWnZcME&t=187s) **Kubernetes Complexity vs Simple Orchestrator** - The speaker contrasts Kubernetes’s multi‑YAML, component‑heavy deployments with a lightweight workload orchestrator that uses a single HCL job, claiming it has a gentler learning curve and enables faster application rollouts. - [00:08:00](https://www.youtube.com/watch?v=YsEnqWnZcME&t=480s) **Flexible Orchestration Beyond Kubernetes** - The speaker explains why Kubernetes isn’t ideal for ephemeral batch workloads and proposes a dedicated workload orchestrator that can flexibly manage web apps, training, batch jobs, and inference with resource‑specific job specs. ## Full Transcript
0:00Imagine you need to run application on a fleet of servers. Let me show you something. We got a VMware 0:07here as VM1 and another one. VM1, uh, VM2 and 0:13then VM3. Looking at this, I need to deploy an 0:20application on each server. What you're going to do? You're going to simply go log in. 0:27in each and every one as a login, okay. And then, you deploy your application. 0:34Even though it's the same application, you still have to go through the process. So after you 0:39logged in and everything and you install the application, let's say you got a problem. You're 0:43going to have to go log in again, troubleshoot and good luck with that. So normally, this is a very 0:50manual process with workload orchestrators. If something went and 0:57happened wrong, you don't have to do all of this because it's really automates that and eliminates 1:02all the human intervention and automate all of this. So you describe what you want the 1:07application to do, right. You say, these are the resources required. This is the runtime 1:13requirements. And the orchestrator is going to place it on its own. Right. It's going to handle 1:18deployment. It's going to handle scaling. It's going to handle resiliency, all automatically. Ray, 1:25tell me, what is workload orchestration? Great question. So, workload orchestrator is a process 1:32that allows organizations to run multiple apps like web apps, for example. And 1:39we got also AI and ML workflows, also job batches, let's say. 1:46All of this, you can run it on multiple servers and environments. Um, and that's kind of 1:53simplified way of automation, which are common and the workflow, like, also automates the 1:59scheduling, automates the placement and the health monitoring as well. As that being said, workload 2:05orchestrator, it just takes the whole manual process that I was talking about earlier and 2:10automates it for you and doesn't, doesn't work. You don't have to worry about it if anything fails. 2:17can you give me an example? Yeah. So, for example, if, let's say we have a server here, and this 2:24server failed, the workload or ... orchestrator, I will say, WO, 2:31will automatically check a job and bring it back to the same state. Did David get 2:38interview here? No. Did anything ... Did Ray touch it? No. We didn't have to log in and didn't have to do 2:42anything. So actually, this whole manual process was eliminated, and it ... and the workload 2:47orchestrator, by itself, was able to treat the failure not as a crisis as before we used to. 2:54Now it's treating it as a business as usual. We use Kubernetes, the two of us, right? Yeah. Many 3:01companies use Kubernetes. We love Kubernetes. Why would we use this approach? Great question. 3:07Kubernetes is an amazing tool. I'm not ... There's no doubt about it. But let's think about it. If you 3:12have a deployment, let's say, this is the deployment, okay. Does deployment come by itself? No. 3:19It comes with a lot of components. You can talk like we can say CM config map. We can say a secret. 3:25We can say storage. How many YAMLs here we're talking about? We're talking about four configured 3:31YAMLs, right? Let's see. So, on the other side, if we're talking about 3:38workload orchestrator, we got the DevOps guy. He just does 3:44something simple. He just pushes one single HCL job. 3:51Where it goes? It goes to your workload orchestrator, and it frees that 3:58job, finds out how many replicas you want and then deploys your application. Assuming 4:05you got three replicas, you got three applications and also respecting your 4:12strategy. You got a deployment strategy here. And also all of this just runs on a single server. You 4:19got an operating system down there, whereas the orchestrating living on the top, all of this is 4:24inside your VM. So I see YAMLs. We know YAMLs. We've written YAMLs for 4:31Kubernetes. What is a job and is it... is there a hard learning curve to learn it? I don't think so 4:37there's a hard learning curve. You have a single file, and it's a very lightweight. The best case 4:43scenarios: the DevOps will go, will deploy a traditional application. It will take him days, 4:50not months. So in this case, as he goes, he has another feature. So we start to learn about it. And 4:57another feature, and I start to learn about it. So it's not about the application itself, it's 5:04about your own pace. So you can go by your own speed. You want to go and learn the whole thing 5:09quick. Sure, you can do that if you want to go, take it easy. And so go by, go as you go. And that's 5:16what I recommend. Yeah. Workload orchestrator sounds useful for traditional application 5:23deployments. What about AI and ML? Do ... they 5:30can be blending in the whole thing here. So what is the traditional approach for 5:37AI, right? Some companies, they're going to spin up a Kubernetes cluster for their web apps, another 5:43one for your training and a third for their batch. Right. 5:51Some other companies are just going to put it all on one cluster. All that entails, right? Complex 5:56namespace, configurations. You have resource quotas. Now, 6:03whichever way you go with this, you're going to end up with a headache, right? If you have multiple 6:10clusters, you have multiple ops teams, multiple monitoring teams. If you have one cluster that's 6:15supposed to be for everybody, it's going to be difficult to deal with. And on top of all of this, 6:21AI workloads keep evolving. Right. You had ... Eight years ago, we didn't have transformer models. Three 6:26years ago, we didn't have GPT-level inference set. Now, if you've worked with AI or ML workloads before, you 6:33may have noticed a pattern. So, for your web teams, they're going to be deploying microservices 6:39on Kubernetes. Your data scientists, they might be using Slurm because that's what they use in 6:46grad school, right? And that's how they work with GPU jobs. Your data engineers, they're going to 6:51work with Airflow as their pipeline management. And then the ML team, they might be deploying 6:57services, you know, in containers. Or maybe they're just SSHing, yeah, directly into a box and running 7:04custom scripts. That's four teams with four tool sets and 7:11four totally different sets of expertise. If something breaks, well, good luck figuring out 7:16which system is the problem. Okay, so this is awesome. Walk me through a real-world 7:23example. Okay, so if you have a data scientist that wants to run a training job, 7:29they're first going to make, file a ticket with their DevOps. Yeah, they're going to wait for the 7:33cluster to be ready. They're going to wait for the approval. And a couple days later, they're able to 7:37train their model. Exactly. At the same exact time, the web team has deployed continuously on 7:43Kubernetes. So this is the same company, the same organization. It's just a totally different 7:48universes of efficiency. AI workloads are fundamentally different. A training job could run 7:54for 3 hours and never run again. Your inferencing service has to be on 24x7. 8:00It's connected to your GPU. And your pipeline runs on a schedule. This needs something like flexible 8:06orchestration, and especially considering we do not know what's coming next but we do know what's 8:10coming, it has to stay flexible. Awesome. So, hold on. Flexible. Sounds 8:17interesting. Tell me more about it. First, a caveat, right? Why can't we do this on Kubernetes? 8:24We can do this on Kubernetes, right. We could run batch jobs. We have cron jobs. All this stuff works. 8:29But if you've ever tried to run ephemeral services like, you know, batch workload and 8:34Kubernetes, it's a bit awkward. And that's because Kubernetes wasn't designed for this It was 8:39designed for long-running containerized services that it keeps alive. Now workload orchestrators, 8:45they're a bit different, right? If you have a batch job or you have a system job, or if you have a 8:49service that you need running, they're all first class citizens in the scheduler So, tell me more 8:56about this in the drawing, maybe. Sure. So, like when you think about flexible orchestration. Let's get 9:02back to that. Yeah. Right. You're going to have one cluster. Sure. Right. And this is gonna be your 9:08workload orchestrator. And it's going to be able to spin up your web app, okay. It's gonna be 9:15able to spin up your training and your batches, then run once and then never run again. It's going 9:21to have your inferencing. And each of these services are going to be tied to 9:27aresource. And each of these resources can be defined within the job spec. 9:34So it makes it particularly easy. So you could have a job spec that says okay I need this spread 9:40out between one data center, or I need this on a combination of data centers, or I need this to be 9:45on a specific data center on a specific rack. Awesome. So what I'm looking at here, I 9:52can imagine there is like, um, correct me if I'm wrong, before and after thing. So if we're looking 9:58at before here, and I'll give you a square here saying we got, you said 10:05fragmented ops. Also, we got 10:12many tools. And also there is 10:19special knowledge, let me say. Yeah. When you go to flexible orchestration, instead of all of 10:25that, we're going to have unified ops, okay. We're going to have 10:32one tool. Perfect. And we're going to have shared knowledge, 10:42which makes this so much easier and so much more scalable. Okay So 10:49what what does that like really mean for teams, you know. So if you look back at the data 10:55scientist, right? This time, they could schedule their own training. They could do it themselves. It 10:59takes a couple of minutes, not a couple of days. They're in the same exact workflow as their web 11:03app team. Your DevOps team, they have one platform that they need to know really well. Okay, 11:09so if something breaks at 2 AM, they know where to look. They have one set of logs to look at, and 11:15it just works well. So, this is the power of flexible orchestration. And when the next 11:22AI breakthrough comes, because we know it's coming, you don't rebuild your infrastructure. You write a 11:27new job spec. That's operational simplicity without sacrificing capability.