Framework for Selecting Foundation Models
Key Points
- Selecting a foundation model requires balancing factors like training data, parameter count, bias risks, and hallucination potential rather than simply opting for the largest model.
- A practical six‑stage AI model selection framework involves (1) defining the use case, (2) listing available model options, (3) gathering each model’s size, performance, cost, and risk metrics, (4) evaluating those characteristics against the use case, (5) testing candidates, and (6) choosing the model that delivers the greatest value.
- In the example of generating personalized marketing emails, the organization narrows its choices to two existing models—Meta’s Llama 2‑70B and IBM’s Granite‑13B—and assesses them based on model cards, fine‑tuning relevance, and known performance for text generation.
- By comparing size, cost, deployment complexity, and risk profiles, and then running targeted tests, the team can select the model that best meets accuracy, efficiency, and business‑value requirements for the specific email‑writing task.
Sections
- Framework for Selecting Foundation Models - The speaker outlines a six‑stage process for choosing the appropriate generative AI model based on the specific use case, model characteristics, costs, risks, and testing.
- Factors for Model Evaluation - The passage explains how fine‑tuned or zero‑shot use of pre‑trained foundation models impacts performance and outlines three key evaluation criteria—accuracy, reliability (including consistency, explainability, trustworthiness, and toxicity avoidance), and speed.
- Balancing Cloud, On‑Prem, Multi‑Model Strategy - The passage contrasts running an open‑source Llama 2 model on public cloud versus fine‑tuning it on‑premise—emphasizing security, cost, and compute trade‑offs—and proposes a framework for pairing different foundation models with varied enterprise use cases.
Full Transcript
# Framework for Selecting Foundation Models **Source:** [https://www.youtube.com/watch?v=pePAAGfh-IU](https://www.youtube.com/watch?v=pePAAGfh-IU) **Duration:** 00:07:54 ## Summary - Selecting a foundation model requires balancing factors like training data, parameter count, bias risks, and hallucination potential rather than simply opting for the largest model. - A practical six‑stage AI model selection framework involves (1) defining the use case, (2) listing available model options, (3) gathering each model’s size, performance, cost, and risk metrics, (4) evaluating those characteristics against the use case, (5) testing candidates, and (6) choosing the model that delivers the greatest value. - In the example of generating personalized marketing emails, the organization narrows its choices to two existing models—Meta’s Llama 2‑70B and IBM’s Granite‑13B—and assesses them based on model cards, fine‑tuning relevance, and known performance for text generation. - By comparing size, cost, deployment complexity, and risk profiles, and then running targeted tests, the team can select the model that best meets accuracy, efficiency, and business‑value requirements for the specific email‑writing task. ## Sections - [00:00:00](https://www.youtube.com/watch?v=pePAAGfh-IU&t=0s) **Framework for Selecting Foundation Models** - The speaker outlines a six‑stage process for choosing the appropriate generative AI model based on the specific use case, model characteristics, costs, risks, and testing. - [00:03:11](https://www.youtube.com/watch?v=pePAAGfh-IU&t=191s) **Factors for Model Evaluation** - The passage explains how fine‑tuned or zero‑shot use of pre‑trained foundation models impacts performance and outlines three key evaluation criteria—accuracy, reliability (including consistency, explainability, trustworthiness, and toxicity avoidance), and speed. - [00:06:16](https://www.youtube.com/watch?v=pePAAGfh-IU&t=376s) **Balancing Cloud, On‑Prem, Multi‑Model Strategy** - The passage contrasts running an open‑source Llama 2 model on public cloud versus fine‑tuning it on‑premise—emphasizing security, cost, and compute trade‑offs—and proposes a framework for pairing different foundation models with varied enterprise use cases. ## Full Transcript
If you have a use case for generative AI,
how do you decide on which foundation model to pick to run it?
With the huge number of foundation models out there,
It's not an easy question.
Different models are trained on different data and have different parameter counts,
and picking the wrong model can have severe unwanted impact,
like biases originating from the training data or hallucinations that are just plain wrong.
Now, one approach is to just pick the largest,
most massive model out there to execute every task.
The largest models have huge parameter counts
and are usually pretty good generalists, but with large models come costs,
costs of compute, cost of complexity and costs of variability.
So often the better approach is to pick the right size model for the specific use case you have.
So let me propose to you an AI model selection framework.
It has six pretty simple stages.
Let's take a look at what they areand then give some examples of how this might work.
Now, stage one, that is to clearly articulate your use case.
What exactly are you planning to use generative A.I. for?
From there you'll list some of the model options available to you.
Perhaps there are already a subset of foundation models running that you have access to.
With a short list of models you'll next want to identify each model's size,
performance costs, risks, and deployment methods.
Next, evaluate those model characteristics for your specific use case.
Run some tests.
That's the next stage,
testing options based on your previously identified use case and deployment needs.
And then finally, choose the option that provides the most value.
So let's put this framework to the test.
Now, my use case, we're going to say that is a use case for text generation.
I need the AI to write personalized emails for my awesome marketing campaign.
That's stage one.
Now, my organization is already using two foundation models for other things,
so I'll evaluate those.
First of all, we've got Llama 2
and specifically the Llama 2 70 model. a fairly large model, 70 billion parameters.
It's from meta and I know it's quite good at some text generation use cases.
Then there's also Granite that we have deployed.
Granite is a smaller general purpose model and that's from IBM.
And I know there is a 13 billion parameter model
that I've heard does quite well with text generation as well.
So those are the models I'm going to evaluate, Llama 2 and Granite.
Next, we need to evaluate model size, performance, and risks.
And a good place to start here is with the model card.
The model cards might tell us if the model has been
trained on data specifically for our purposes.
Pre-trained Foundation models are fine tuned for specific use cases
such as sentiment analysis or document summarization or maybe text generation.
And that's important to know because if a model is pre trained
on a use case close to ours, it may perform better when processing our prompts
and enable us to use zero shot prompting to obtain our desired results.
And that means we can simply ask the model to perform tasks
without having to provide multiple completed examples first.
Now, when it comes to evaluating model performance for our use case, we can consider three factors.
The first factor that we would consider is accuracy.
Now, accuracy denotes how close the generated output is to the
desired output, and it can be measured objectively and repeatedly
by choosing evaluation metrics that are relevant to your use cases.
So for example, if your use case related to text translation,
the B.L.E.U. - that's the BiLingual Evaluation Understudy benchmark,
can be used to indicate the quality of the generated translations.
Now the second factor relates to reliably of the model.
Now that's a function of several factors actually, such as consistency,
explainability and trustworthiness,
as well as how well a model avoids toxicity like hate speech.
Reliability comes down to trust,
and trust is built through transparency and traceability of the training data
and accuracy and reliability of the output.
And then the third factor that is speed.
And specifically we're saying
how quickly does a user get a response to a submitted prompt?
Now, speed and accuracy are often a trade off here.
Larger models may be slower, but perhaps deliver a more accurate answer.
Or then again, maybe the smaller model is faster
and has minimal differences in accuracy to the larger model.
It really comes down to finding the sweet spot between performance, speed and cost.
A smaller, less expensive model may not offer
performance or accuracy metrics on par with an expensive one, but
it would still be preferable over the latter.
If you consider any additional benefits, the model might deliver like lower latency
and greater transparency into the model inputs and outputs.
The way to find out is to simply select the model that's likely
to deliver the desired output and well, test it.
Test that model with your prompts to see if it works,
and then assess the model, performance and quality of the output using metrics.
Now, I've mentioned deployment in passing, so a quick word on that.
As a decision factor, we need to evaluate where and how we want the model and data to be deployed.
So let's say that we're leaning towards Llama 2
as our chosen model based on our testing.
Right, cool. Llama 2.
That's an open source model and we could inference with it on a public cloud.
So we've got a public cloud already out here.
It's got an element of choice in it, which is limited to we can just inference to that.
But if we decide we want to fine tune the model with our own enterprise data,
we might need to deploy it on prem.
So this is where we have our own version of Llama two
and we are going to provide fine tuning to it.
Now, deploying on premise gives you greater control,
and more security benefits compared to a public cloud environment.
But it's an expensive proposition,
especially when factoring model size and compute power,
including the number of GPUs it takes to run a single large language model.
Now, everything we've discussed here is tied to a specific use case,
but of course it's quite likely that any given organization will have multiple use cases.
And as we run through this model selection framework,
we might find that each use case is better suited to a different foundation model.
That's called a multi model approach.
Essentially, not all A.I. models are the same, and neither are your use cases.
And this framework might be just what you need to pair the models
and the use cases together to find a winning combination of both.