Learning Library

← Back to Library

Domain‑Specific LLM Training with InstructLab

7m • Unknown Channel • ai-ml • deep-dive • intermediate • Watch on YouTube ↗

Key Points

The traditional LLM pipeline relies on data engineers and data scientists to curate structured database inputs, which makes it hard to incorporate domain‑specific knowledge stored in unstructured formats.
Tools like InstructLab let project managers and business analysts feed domain knowledge from documents (Word, PDFs, text files) into a git‑based taxonomy, eliminating the need for a dedicated data‑scientist step.
InstructLab automatically generates synthetic variations of questions from the curated content, enriching training data and improving the model’s ability to understand diverse prompts.
After training, the model can be deployed on Kubernetes‑based platforms such as Red Hat OpenShift, leveraging GPU/accelerator hardware from NVIDIA, AMD, or Intel.
The OpenShift AI extension provides the MLOps stack for model serving, inference configuration, and metrics collection, completing the end‑to‑end lifecycle.

Sections

Full Transcript

# Domain‑Specific LLM Training with InstructLab **Source:** [https://www.youtube.com/watch?v=0OOXGwLENyY](https://www.youtube.com/watch?v=0OOXGwLENyY) **Duration:** 00:07:50 ## Summary - The traditional LLM pipeline relies on data engineers and data scientists to curate structured database inputs, which makes it hard to incorporate domain‑specific knowledge stored in unstructured formats. - Tools like InstructLab let project managers and business analysts feed domain knowledge from documents (Word, PDFs, text files) into a git‑based taxonomy, eliminating the need for a dedicated data‑scientist step. - InstructLab automatically generates synthetic variations of questions from the curated content, enriching training data and improving the model’s ability to understand diverse prompts. - After training, the model can be deployed on Kubernetes‑based platforms such as Red Hat OpenShift, leveraging GPU/accelerator hardware from NVIDIA, AMD, or Intel. - The OpenShift AI extension provides the MLOps stack for model serving, inference configuration, and metrics collection, completing the end‑to‑end lifecycle. ## Sections - [00:00:00](https://www.youtube.com/watch?v=0OOXGwLENyY&t=0s) **Integrating Domain Knowledge with InstructLab** - The speaker explains how to augment the traditional LLM data pipeline by involving project managers and analysts and using InstructLab’s git‑based taxonomy to incorporate non‑database documents into model training. - [00:03:12](https://www.youtube.com/watch?v=0OOXGwLENyY&t=192s) **Synthetic Data‑Driven LLM Deployment** - The speaker explains how InstructLab automates synthetic data generation and model training, then deploys the LLM on Kubernetes/OpenShift using AI accelerators and Red Hat OpenShift AI to manage inference, metrics, and the full MLOps lifecycle with governance. - [00:06:25](https://www.youtube.com/watch?v=0OOXGwLENyY&t=385s) **Cost‑Effective RAG Model Pipeline** - The segment outlines a budget‑aware workflow that runs data processing intermittently, leverages RAG and a taxonomy‑driven pipeline—including project manager and analyst input, synthetic data generation via InstructLab, and training with watsonx.ai—before deploying the model on an OpenShift/Kubernetes platform. ## Full Transcript

0:00Hey, everybody. 0:01Today, I want to talk to you about how to apply domain specific knowledge to your LLM lifecycle. 0:09A traditional approach starts with a data engineer. 0:17Who's curating data that is then used. 0:22By data scientists. 0:26Who ultimately takes that data. 0:37Trains it to the model. 0:46And then makes that model available for inference. 0:56One of the challenges with this, though. 0:59Is that the data over here being used, 1:06it's typically a traditional database of some sort, either sequel or no sequel, 1:12and it's usually containing metrics 1:14or sales data or anything that's typically organized or curated by an organization. 1:21The challenge here is getting domain specific knowledge within 1:25an organization and applying it to this same process. 1:30Now let's look at the same approach, 1:31but use a variety of tools that can empower people like project managers, 1:37and business analysts to be contributing to the process. 1:44So here we have a project manager 1:49And business analyst. 1:52They both have domain specific knowledge about processes 2:01within their organization, 2:03but these could be stored in word documents or text files of some sort, 2:09not the traditional data stores that we have that we typically use within a model lifecycle, 2:17but we can change this. 2:19We can use a tool like InstructLab to manage this process. 2:33InstructLab is an open source tool that allows the management of what we call a taxonomy. 2:42This taxonomy is just a git repository typically where we can manage things like indie files or text files 2:52and then apply that to our model. 3:00We could even have more traditional document formats like PDFs, 3:06and have those be transformed and the necessary file structure that InstructLab takes. 3:13Once they've applied that data to the taxonomy that InstructLab manages, 3:20we can then start the more traditional process that we saw earlier, 3:26but we don't actually need a data scientist in this case. 3:30InstructLab is handling all that, 3:32and it will then create synthetic data 3:40through this process. 3:44Now I know synthetic data. 3:47That sounds kind of scary, but I want to approach it in a different way. 3:53Think of synthetic data in this case as just another way of reframing the question, 3:59instead of one way of asking a question, 4:03We have many different ways of asking the same question. 4:08This empowers the model, especially when we go through the training cycle, 4:13to then apply more opportunities to the LLM to be able to accurately reply to your prompts. 4:24Once we've trained the model, we can then go ahead and deploy it into an AI platform. 4:34This could be Kubernetes based like OpenShift, for example. 4:43and this can take advantage of different AI accelerators that can be used 4:50like NVIDIA, 4:56AMD, or Intel, for example. 5:03Now this is the infrastructure layer, 5:06but we want to be able to interact with our model and be able to configure 5:11the inference and be able to apply metrics 5:14and all those things that need to be part of the MLOps lifecycle. 5:19Now we can do this. 5:21With an extension for OpenShift called Red Hat OpenShift AI 5:31Which will provide you all those tools for managing the lifecycle of this model within production. 5:39Now, you may want to then interact with that model. 5:43You want to validate it or apply governance or other things. 5:48You even just sandbox with it. 5:50This could be done with something like watsonx.AI. 5:57I can sit on top of OpenShift and interact with all the models that are being inferred within this AI stack. 6:06Now once this life cycle has finished. 6:10We can then restart the whole process again 6:15and use the new data that has since been built up by our project managers and business analysts 6:22and go through this lifecycle once more. 6:26But one thing to note is that can be really costly. 6:31We may not want to run this process over and over again every week. 6:35We may only have the budget to run it maybe once a month or once every other month. 6:39Well, we can use technologies like RAG, 6:47And how this data come over here in the interim before we go through this process again. 6:58Once we do that, we can flush out our RAG database and then start a new. 7:05As data is collected by our project managers and our business analysts. 7:11All right. 7:11We shown the complete model lifecycle 7:14and how to apply domain specific knowledge 7:17from people within our organization, like project managers and business analysts. 7:22Then manage that data through a taxonomy, through InstructLab, 7:27generate synthetic data that's then used for training that model, 7:31and then ultimately deploying into a Kubernetes based platform like OpenShift. 7:37Utilizing AI services from tools like watsonx.ai, 7:42and ultimately using technologies like RAG to enhance that experience. 7:49Thank you so much for watching.