Learning Library

← Back to Library

Running LLMs Locally with Ollama

Key Points

  • Ollama lets you run open‑source large language models locally, eliminating reliance on external cloud services and reducing AI‑related costs.
  • By using a single CLI command (e.g., `ollama run `), you can download, launch, and interact with optimized, quantized models directly from your terminal on Windows, macOS, or Linux.
  • Running models locally ensures that all data stays within your secure environment, which is crucial for organizations that need to protect customer information.
  • Ollama provides a curated catalog of ready‑to‑use language, multi‑modal, embedding, and tool‑calling models (such as Llama, Mistral, Granite, DeepSeek), simplifying deployment and integration into your applications.

Full Transcript

# Running LLMs Locally with Ollama **Source:** [https://www.youtube.com/watch?v=5RIOQuHOihY](https://www.youtube.com/watch?v=5RIOQuHOihY) **Duration:** 00:07:01 ## Summary - Ollama lets you run open‑source large language models locally, eliminating reliance on external cloud services and reducing AI‑related costs. - By using a single CLI command (e.g., `ollama run `), you can download, launch, and interact with optimized, quantized models directly from your terminal on Windows, macOS, or Linux. - Running models locally ensures that all data stays within your secure environment, which is crucial for organizations that need to protect customer information. - Ollama provides a curated catalog of ready‑to‑use language, multi‑modal, embedding, and tool‑calling models (such as Llama, Mistral, Granite, DeepSeek), simplifying deployment and integration into your applications. ## Sections - [00:00:00](https://www.youtube.com/watch?v=5RIOQuHOihY&t=0s) **Running LLMs Locally with Ollama** - The speaker introduces Ollama, an open‑source tool that lets developers run and deploy quantized large language models on their own machines, cutting cloud costs and keeping data private. - [00:03:11](https://www.youtube.com/watch?v=5RIOQuHOihY&t=191s) **Choosing Multi‑Modal, Embedding, and Tool‑Calling Models** - An overview of multimodal, embedding, and tool‑calling AI models, guidance on selecting suitable LLMs like LLaMA, IBM Granite, or reasoning models, and a note on using Ollama model files for streamlined deployment. - [00:06:27](https://www.youtube.com/watch?v=5RIOQuHOihY&t=387s) **Running Open‑Source LLMs Locally with Ollama** - The speaker highlights Ollama as a convenient way for developers to host and manage large language models on‑premise, saving cloud costs, protecting sensitive data, and supporting limited‑connectivity environments, while noting it’s one of several possible solutions. ## Full Transcript
0:00By now, you've probably tried out some really helpful AI models 0:03to summarize your data, to act as a pair programmer, and much more. 0:08But traditionally, this meant using cloud services. 0:11Perhaps you were using an LLM through some type of chatbot, or for me, an API that was hosted on a cloud service. 0:19But at the end of the day, I was using someone else's cloud computing resources. 0:23Now, what if I told you, though, that there's open source way. 0:27to run AI models and LLMs locally from your own machine. 0:32And that allows you to save on cost for your AI bills, 0:36to keep your data private and as a developer to build out applications and features that use AI from your on machine. 0:44Now that's right, today we're gonna be taking a look at Ollama to run large language models or LLMs from your old machine. 0:53Now, this is really, really cool because you're able to take a model 0:56which might be quantized or compressed and run it from your own system resources 1:02and integrate with a huge ecosystem of models from Llama to Mistral and beyond. 1:07And for organizations that are looking to use AI in their applications, it can be very helpful 1:12because we're able take and deploy a small or large language model 1:17locally and ensure that customer data doesn't leave the secure environment at all. 1:22And this is all using a simple command that we're going to take a look at here in a second. 1:27But how does it work and what should you know about using Ollama? 1:31Well let's begin with the Ollamas CLI. 1:33Now whether you're on Windows, Mac or Linux, 1:37you can head over to ollama.com in order to download the CLI or command line interface for your machine. 1:46Now this allows you to download models, to run them, and to interact with them all from your own terminal. 1:51While in the past you had to go over to repositories such as Hugging Face 1:56in order to download model weights, 1:57and you had work with complicated setup in order to get the model ready to be inferenced and chatted with, 2:03this is all simplified with Ollama through a single command, Ollamarrun. 2:10After that, we'll pass in the argument or parameter of the model name. 2:14It could be granite, could be llama, could be deep seek, and that'll kick off the process. 2:19to download one of the Ollamas own models, 2:22their compressed and optimized models to your machine, and start up an inference server, 2:26similar to how you would start up a web server and serve your web pages. 2:31This will drop you into a GPT looking chat window, 2:34and you can almost think of the ollama run command as a package manager for AI, 2:39allowing you to run and manage models with a single command. 2:43Now, that's awesome, but speaking of models, Thanks for watching! 2:46Ollama has a catalog of standardized and customizable language, multi-model embedding and tool calling models. 2:53Now, I just said a lot of things, but let's break it down here. 2:57The first one that I wanted to mention is a type of model for language. 3:01So for example, that means working with your text and data, 3:04either in a conversational or base type of format or an instructional format for answering questions and answers. 3:11Now, the second type is multi-modal models. 3:14So for example. 3:16working with images and being able to analyze, hey, what's going on in this specific frame? 3:21Now, the next one is a type of model called embedding, 3:24which is essentially taking our data from PDFs and other types of data formats 3:29and preparing it to be used in a vector database to do question and answering on our own unique data. 3:35And then finally, the last type of model that you can use is tool calling. 3:39So, it's a fine-tuned version of a language model. 3:43that is familiar with calling different functions, APIs, and services in an agentic way. 3:48Now, these are some of the types of models, but how would you pick the right model for your use case? 3:54Well, it's a good question, and it depends on your project's requirements. 3:58But some of popular models that we see being used a lot in the community are, for example, llamas series of models. 4:06Different open and fine-tuned models for various use cases that provide support for different languages. 4:12as well as IBM's Granite model. 4:15So, the Granite Model is a enterprise ready LLM that can be used with RAG or agentic functionality. 4:22And we also see these types of reasoning models that are being 4:26more and more popular that essentially provide chain of thought or thinking capabilities to answer your questions. 4:34Now, beyond just using the Model Catalog, you can actually take advantage of what's known 4:37as the Ollama model file to essentially just how Docker has abstracted the complexities of containers. 4:46Well, we're using a model file to abstract the complexities models to be able to import 4:51from Hugging Face, for example, 4:53or to start from a model that you already have 4:56and customize it with system prompts and different parameters to be the best model for your use cases. 5:02But no matter what type of model that you want to use, 5:05your request at the end of the day will be passed through the Ollama server, which is running on localhost on port 11434. 5:13So for example, let's say you're making a request or prompt to the model from your CLI, from your terminal with Ollama. 5:21Well, this is actually being passed to the Ollama model server, 5:24and in a similar way for applications that might want to use models, 5:28for example with Langchain or another framework, you're make a post request to the model that's running on local host. 5:35on this specific port, which is a REST server. 5:39So it has endpoints, and we can make that request 5:42similar to how we would make a request to any other service that's running on our machine. 5:46And right here is the simplicity of Ollama for developers. 5:50It lifts the weight of having to run the model in your application, and it abstracts the model as an API. 5:56So you make that requests, and you get that response back all locally on your machine. 6:01Or let's say you want to run Ollama on another machine and you just SSH there, or you make that request remotely. 6:08No matter what you do though, Ollama is doing the heavy lifting of running the model. 6:12And you can even connect other interfaces, such as Open Web UI, 6:16in order to set up a simple RAG pipeline and use your PDFs or other documents 6:21to be passed with the context of that information to Ollama and get that response back. 6:28But that is the Ollama server. 6:29Now. 6:31Whether you want to save on cloud costs when you're using AI models, or have 6:35private sensitive data that can't leave your premise, or even 6:38limited internet access in an IoT device, 6:41you can use Ollama and the power of open source AI to use and manage LLMs from your local machine. 6:48Now, it isn't the only tool for doing this, but as a developer, it's made my life much easier, and I encourage you to check it out. 6:55As always, thank you so much for joining us today. 6:58Make sure to like the video if you learned something. 7:00and have a good one!