Domain‑Specific LLM Training with InstructLab
Key Points
- The traditional LLM pipeline relies on data engineers and data scientists to curate structured database inputs, which makes it hard to incorporate domain‑specific knowledge stored in unstructured formats.
- Tools like InstructLab let project managers and business analysts feed domain knowledge from documents (Word, PDFs, text files) into a git‑based taxonomy, eliminating the need for a dedicated data‑scientist step.
- InstructLab automatically generates synthetic variations of questions from the curated content, enriching training data and improving the model’s ability to understand diverse prompts.
- After training, the model can be deployed on Kubernetes‑based platforms such as Red Hat OpenShift, leveraging GPU/accelerator hardware from NVIDIA, AMD, or Intel.
- The OpenShift AI extension provides the MLOps stack for model serving, inference configuration, and metrics collection, completing the end‑to‑end lifecycle.
Sections
- Integrating Domain Knowledge with InstructLab - The speaker explains how to augment the traditional LLM data pipeline by involving project managers and analysts and using InstructLab’s git‑based taxonomy to incorporate non‑database documents into model training.
- Synthetic Data‑Driven LLM Deployment - The speaker explains how InstructLab automates synthetic data generation and model training, then deploys the LLM on Kubernetes/OpenShift using AI accelerators and Red Hat OpenShift AI to manage inference, metrics, and the full MLOps lifecycle with governance.
- Cost‑Effective RAG Model Pipeline - The segment outlines a budget‑aware workflow that runs data processing intermittently, leverages RAG and a taxonomy‑driven pipeline—including project manager and analyst input, synthetic data generation via InstructLab, and training with watsonx.ai—before deploying the model on an OpenShift/Kubernetes platform.
Full Transcript
# Domain‑Specific LLM Training with InstructLab **Source:** [https://www.youtube.com/watch?v=0OOXGwLENyY](https://www.youtube.com/watch?v=0OOXGwLENyY) **Duration:** 00:07:50 ## Summary - The traditional LLM pipeline relies on data engineers and data scientists to curate structured database inputs, which makes it hard to incorporate domain‑specific knowledge stored in unstructured formats. - Tools like InstructLab let project managers and business analysts feed domain knowledge from documents (Word, PDFs, text files) into a git‑based taxonomy, eliminating the need for a dedicated data‑scientist step. - InstructLab automatically generates synthetic variations of questions from the curated content, enriching training data and improving the model’s ability to understand diverse prompts. - After training, the model can be deployed on Kubernetes‑based platforms such as Red Hat OpenShift, leveraging GPU/accelerator hardware from NVIDIA, AMD, or Intel. - The OpenShift AI extension provides the MLOps stack for model serving, inference configuration, and metrics collection, completing the end‑to‑end lifecycle. ## Sections - [00:00:00](https://www.youtube.com/watch?v=0OOXGwLENyY&t=0s) **Integrating Domain Knowledge with InstructLab** - The speaker explains how to augment the traditional LLM data pipeline by involving project managers and analysts and using InstructLab’s git‑based taxonomy to incorporate non‑database documents into model training. - [00:03:12](https://www.youtube.com/watch?v=0OOXGwLENyY&t=192s) **Synthetic Data‑Driven LLM Deployment** - The speaker explains how InstructLab automates synthetic data generation and model training, then deploys the LLM on Kubernetes/OpenShift using AI accelerators and Red Hat OpenShift AI to manage inference, metrics, and the full MLOps lifecycle with governance. - [00:06:25](https://www.youtube.com/watch?v=0OOXGwLENyY&t=385s) **Cost‑Effective RAG Model Pipeline** - The segment outlines a budget‑aware workflow that runs data processing intermittently, leverages RAG and a taxonomy‑driven pipeline—including project manager and analyst input, synthetic data generation via InstructLab, and training with watsonx.ai—before deploying the model on an OpenShift/Kubernetes platform. ## Full Transcript
Hey, everybody.
Today, I want to talk to you about how to apply domain specific knowledge to your LLM lifecycle.
A traditional approach starts with a data engineer.
Who's curating data that is then used.
By data scientists.
Who ultimately takes that data.
Trains it to the model.
And then makes that model available for inference.
One of the challenges with this, though.
Is that the data over here being used,
it's typically a traditional database of some sort, either sequel or no sequel,
and it's usually containing metrics
or sales data or anything that's typically organized or curated by an organization.
The challenge here is getting domain specific knowledge within
an organization and applying it to this same process.
Now let's look at the same approach,
but use a variety of tools that can empower people like project managers,
and business analysts to be contributing to the process.
So here we have a project manager
And business analyst.
They both have domain specific knowledge about processes
within their organization,
but these could be stored in word documents or text files of some sort,
not the traditional data stores that we have that we typically use within a model lifecycle,
but we can change this.
We can use a tool like InstructLab to manage this process.
InstructLab is an open source tool that allows the management of what we call a taxonomy.
This taxonomy is just a git repository typically where we can manage things like indie files or text files
and then apply that to our model.
We could even have more traditional document formats like PDFs,
and have those be transformed and the necessary file structure that InstructLab takes.
Once they've applied that data to the taxonomy that InstructLab manages,
we can then start the more traditional process that we saw earlier,
but we don't actually need a data scientist in this case.
InstructLab is handling all that,
and it will then create synthetic data
through this process.
Now I know synthetic data.
That sounds kind of scary, but I want to approach it in a different way.
Think of synthetic data in this case as just another way of reframing the question,
instead of one way of asking a question,
We have many different ways of asking the same question.
This empowers the model, especially when we go through the training cycle,
to then apply more opportunities to the LLM to be able to accurately reply to your prompts.
Once we've trained the model, we can then go ahead and deploy it into an AI platform.
This could be Kubernetes based like OpenShift, for example.
and this can take advantage of different AI accelerators that can be used
like NVIDIA,
AMD, or Intel, for example.
Now this is the infrastructure layer,
but we want to be able to interact with our model and be able to configure
the inference and be able to apply metrics
and all those things that need to be part of the MLOps lifecycle.
Now we can do this.
With an extension for OpenShift called Red Hat OpenShift AI
Which will provide you all those tools for managing the lifecycle of this model within production.
Now, you may want to then interact with that model.
You want to validate it or apply governance or other things.
You even just sandbox with it.
This could be done with something like watsonx.AI.
I can sit on top of OpenShift and interact with all the models that are being inferred within this AI stack.
Now once this life cycle has finished.
We can then restart the whole process again
and use the new data that has since been built up by our project managers and business analysts
and go through this lifecycle once more.
But one thing to note is that can be really costly.
We may not want to run this process over and over again every week.
We may only have the budget to run it maybe once a month or once every other month.
Well, we can use technologies like RAG,
And how this data come over here in the interim before we go through this process again.
Once we do that, we can flush out our RAG database and then start a new.
As data is collected by our project managers and our business analysts.
All right.
We shown the complete model lifecycle
and how to apply domain specific knowledge
from people within our organization, like project managers and business analysts.
Then manage that data through a taxonomy, through InstructLab,
generate synthetic data that's then used for training that model,
and then ultimately deploying into a Kubernetes based platform like OpenShift.
Utilizing AI services from tools like watsonx.ai,
and ultimately using technologies like RAG to enhance that experience.
Thank you so much for watching.