Learning Library

← Back to Library

Running LLMs Locally with Ollama

7m • Unknown Channel • ai-ml • tutorial • intermediate • Watch on YouTube ↗

Key Points

Ollama lets you run open‑source large language models locally, eliminating reliance on external cloud services and reducing AI‑related costs.
By using a single CLI command (e.g., `ollama run `), you can download, launch, and interact with optimized, quantized models directly from your terminal on Windows, macOS, or Linux.
Running models locally ensures that all data stays within your secure environment, which is crucial for organizations that need to protect customer information.
Ollama provides a curated catalog of ready‑to‑use language, multi‑modal, embedding, and tool‑calling models (such as Llama, Mistral, Granite, DeepSeek), simplifying deployment and integration into your applications.

Sections

Full Transcript

# Running LLMs Locally with Ollama **Source:** [https://www.youtube.com/watch?v=5RIOQuHOihY](https://www.youtube.com/watch?v=5RIOQuHOihY) **Duration:** 00:07:01 ## Summary - Ollama lets you run open‑source large language models locally, eliminating reliance on external cloud services and reducing AI‑related costs. - By using a single CLI command (e.g., `ollama run `), you can download, launch, and interact with optimized, quantized models directly from your terminal on Windows, macOS, or Linux. - Running models locally ensures that all data stays within your secure environment, which is crucial for organizations that need to protect customer information. - Ollama provides a curated catalog of ready‑to‑use language, multi‑modal, embedding, and tool‑calling models (such as Llama, Mistral, Granite, DeepSeek), simplifying deployment and integration into your applications. ## Sections - [00:00:00](https://www.youtube.com/watch?v=5RIOQuHOihY&t=0s) **Running LLMs Locally with Ollama** - The speaker introduces Ollama, an open‑source tool that lets developers run and deploy quantized large language models on their own machines, cutting cloud costs and keeping data private. - [00:03:11](https://www.youtube.com/watch?v=5RIOQuHOihY&t=191s) **Choosing Multi‑Modal, Embedding, and Tool‑Calling Models** - An overview of multimodal, embedding, and tool‑calling AI models, guidance on selecting suitable LLMs like LLaMA, IBM Granite, or reasoning models, and a note on using Ollama model files for streamlined deployment. - [00:06:27](https://www.youtube.com/watch?v=5RIOQuHOihY&t=387s) **Running Open‑Source LLMs Locally with Ollama** - The speaker highlights Ollama as a convenient way for developers to host and manage large language models on‑premise, saving cloud costs, protecting sensitive data, and supporting limited‑connectivity environments, while noting it’s one of several possible solutions. ## Full Transcript

0:00By now, you've probably tried out some really helpful AI models 0:03to summarize your data, to act as a pair programmer, and much more. 0:08But traditionally, this meant using cloud services. 0:11Perhaps you were using an LLM through some type of chatbot, or for me, an API that was hosted on a cloud service. 0:19But at the end of the day, I was using someone else's cloud computing resources. 0:23Now, what if I told you, though, that there's open source way. 0:27to run AI models and LLMs locally from your own machine. 0:32And that allows you to save on cost for your AI bills, 0:36to keep your data private and as a developer to build out applications and features that use AI from your on machine. 0:44Now that's right, today we're gonna be taking a look at Ollama to run large language models or LLMs from your old machine. 0:53Now, this is really, really cool because you're able to take a model 0:56which might be quantized or compressed and run it from your own system resources 1:02and integrate with a huge ecosystem of models from Llama to Mistral and beyond. 1:07And for organizations that are looking to use AI in their applications, it can be very helpful 1:12because we're able take and deploy a small or large language model 1:17locally and ensure that customer data doesn't leave the secure environment at all. 1:22And this is all using a simple command that we're going to take a look at here in a second. 1:27But how does it work and what should you know about using Ollama? 1:31Well let's begin with the Ollamas CLI. 1:33Now whether you're on Windows, Mac or Linux, 1:37you can head over to ollama.com in order to download the CLI or command line interface for your machine. 1:46Now this allows you to download models, to run them, and to interact with them all from your own terminal. 1:51While in the past you had to go over to repositories such as Hugging Face 1:56in order to download model weights, 1:57and you had work with complicated setup in order to get the model ready to be inferenced and chatted with, 2:03this is all simplified with Ollama through a single command, Ollamarrun. 2:10After that, we'll pass in the argument or parameter of the model name. 2:14It could be granite, could be llama, could be deep seek, and that'll kick off the process. 2:19to download one of the Ollamas own models, 2:22their compressed and optimized models to your machine, and start up an inference server, 2:26similar to how you would start up a web server and serve your web pages. 2:31This will drop you into a GPT looking chat window, 2:34and you can almost think of the ollama run command as a package manager for AI, 2:39allowing you to run and manage models with a single command. 2:43Now, that's awesome, but speaking of models, Thanks for watching! 2:46Ollama has a catalog of standardized and customizable language, multi-model embedding and tool calling models. 2:53Now, I just said a lot of things, but let's break it down here. 2:57The first one that I wanted to mention is a type of model for language. 3:01So for example, that means working with your text and data, 3:04either in a conversational or base type of format or an instructional format for answering questions and answers. 3:11Now, the second type is multi-modal models. 3:14So for example. 3:16working with images and being able to analyze, hey, what's going on in this specific frame? 3:21Now, the next one is a type of model called embedding, 3:24which is essentially taking our data from PDFs and other types of data formats 3:29and preparing it to be used in a vector database to do question and answering on our own unique data. 3:35And then finally, the last type of model that you can use is tool calling. 3:39So, it's a fine-tuned version of a language model. 3:43that is familiar with calling different functions, APIs, and services in an agentic way. 3:48Now, these are some of the types of models, but how would you pick the right model for your use case? 3:54Well, it's a good question, and it depends on your project's requirements. 3:58But some of popular models that we see being used a lot in the community are, for example, llamas series of models. 4:06Different open and fine-tuned models for various use cases that provide support for different languages. 4:12as well as IBM's Granite model. 4:15So, the Granite Model is a enterprise ready LLM that can be used with RAG or agentic functionality. 4:22And we also see these types of reasoning models that are being 4:26more and more popular that essentially provide chain of thought or thinking capabilities to answer your questions. 4:34Now, beyond just using the Model Catalog, you can actually take advantage of what's known 4:37as the Ollama model file to essentially just how Docker has abstracted the complexities of containers. 4:46Well, we're using a model file to abstract the complexities models to be able to import 4:51from Hugging Face, for example, 4:53or to start from a model that you already have 4:56and customize it with system prompts and different parameters to be the best model for your use cases. 5:02But no matter what type of model that you want to use, 5:05your request at the end of the day will be passed through the Ollama server, which is running on localhost on port 11434. 5:13So for example, let's say you're making a request or prompt to the model from your CLI, from your terminal with Ollama. 5:21Well, this is actually being passed to the Ollama model server, 5:24and in a similar way for applications that might want to use models, 5:28for example with Langchain or another framework, you're make a post request to the model that's running on local host. 5:35on this specific port, which is a REST server. 5:39So it has endpoints, and we can make that request 5:42similar to how we would make a request to any other service that's running on our machine. 5:46And right here is the simplicity of Ollama for developers. 5:50It lifts the weight of having to run the model in your application, and it abstracts the model as an API. 5:56So you make that requests, and you get that response back all locally on your machine. 6:01Or let's say you want to run Ollama on another machine and you just SSH there, or you make that request remotely. 6:08No matter what you do though, Ollama is doing the heavy lifting of running the model. 6:12And you can even connect other interfaces, such as Open Web UI, 6:16in order to set up a simple RAG pipeline and use your PDFs or other documents 6:21to be passed with the context of that information to Ollama and get that response back. 6:28But that is the Ollama server. 6:29Now. 6:31Whether you want to save on cloud costs when you're using AI models, or have 6:35private sensitive data that can't leave your premise, or even 6:38limited internet access in an IoT device, 6:41you can use Ollama and the power of open source AI to use and manage LLMs from your local machine. 6:48Now, it isn't the only tool for doing this, but as a developer, it's made my life much easier, and I encourage you to check it out. 6:55As always, thank you so much for joining us today. 6:58Make sure to like the video if you learned something. 7:00and have a good one!