Learning Library

← Back to Library

Running Ollama: Local LLMs on Laptop

5m • Unknown Channel • ai-ml • tutorial • beginner • Watch on YouTube ↗

Key Points

Running large language models locally on your laptop eliminates cloud dependencies, ensuring full data privacy and giving developers direct control over AI resources.
Ollama provides a cross‑platform command‑line tool that lets you download, install, and serve quantized LLMs (e.g., from its model store) on macOS, Windows, or Linux.
The `ollama run` command both pulls the chosen model (like granite‑3.1‑dense) and starts a local inference server, exposing a standard API for chat and programmatic requests.
Local execution uses optimized back‑ends such as llama‑cpp, enabling even limited hardware to run compressed models efficiently.
The granite‑3.1 model showcased supports 11 languages, excels at enterprise tasks, and offers strong Retrieval‑Augmented Generation (RAG) capabilities for integrating proprietary data.

Sections

Full Transcript

# Running Ollama: Local LLMs on Laptop **Source:** [https://www.youtube.com/watch?v=uxE8FFiu_UQ](https://www.youtube.com/watch?v=uxE8FFiu_UQ) **Duration:** 00:05:47 ## Summary - Running large language models locally on your laptop eliminates cloud dependencies, ensuring full data privacy and giving developers direct control over AI resources. - Ollama provides a cross‑platform command‑line tool that lets you download, install, and serve quantized LLMs (e.g., from its model store) on macOS, Windows, or Linux. - The `ollama run` command both pulls the chosen model (like granite‑3.1‑dense) and starts a local inference server, exposing a standard API for chat and programmatic requests. - Local execution uses optimized back‑ends such as llama‑cpp, enabling even limited hardware to run compressed models efficiently. - The granite‑3.1 model showcased supports 11 languages, excels at enterprise tasks, and offers strong Retrieval‑Augmented Generation (RAG) capabilities for integrating proprietary data. ## Sections - [00:00:00](https://www.youtube.com/watch?v=uxE8FFiu_UQ&t=0s) **Running Local LLMs with Ollama** - The speaker introduces Ollama, an open‑source developer tool that lets you install and run large language models locally for privacy‑preserving, cloud‑free AI capabilities such as chat, code assistance, and RAG. - [00:03:13](https://www.youtube.com/watch?v=uxE8FFiu_UQ&t=193s) **Integrating Local LLM via LangChain** - The speaker demonstrates connecting a locally hosted Ollama model to a Java/Quarkus application using LangChain for Java, enabling standardized API calls to automate insurance claim processing. ## Full Transcript

0:00Hey, quick question. 0:01Did you know that you can run the latest large language models locally on your laptop? 0:06This means you don't have any dependencies on cloud services 0:09and you get full data privacy while using optimized models to chat, 0:14uses code assistance and integrate AI into your applications with RAG or even agentic behavior. 0:20So today we're taking a look at Ollama. 0:22It's a developer tool that has been quickly growing in popularity 0:25and we're gonna show you how you can start using it on your machine right now, 0:29but real quick, before we start installing things, what value does this open source project provide to you? 0:35Well, as a developer, traditionally I'd need to request computing resources 0:39or hardware to run something as intensive as a large language model. 0:43And to use cloud services involves sending my data to somebody else, which might not always be feasible. 0:49So by running models from my local machine, I can maintain full control over my AI and use a model through an API, 0:57just like I would with another service, like a database on my own system. 1:01Let's see this in action by switching over to my laptop and heading to ollama.com, 1:05and this is where you can install the command line tool for Mac, 1:09Windows, and of course, Linux, but also browse the repository of models. 1:13For example, foundation models from the leading AI labs, but also 1:17more fine tuned or task specific models such as code assistants. 1:21Which one should you use? 1:23Well, we'll take a look at that soon, 1:24but for now, I'll open up my terminal where Ollama has been installed, 1:28and the first step is downloading and chatting with a model locally. 1:31So now I have Ollama set up on my local machine. 1:35And what we're going to do first is use the Ollama run command, which is almost two commands in one. 1:40What's going to happen is it's going to pull the model from Olamma's model store, 1:44if we don't already have it, and also start up an inference server for us 1:48to make requests to the LLM that's running on our own machine. 1:51So let's go ahead and do that now. 1:53We're going to run Ollama run granite 3.1 dense, 1:56and so while we have a chat interface here where we could ask questions, behind the scenes, 2:00what we've done is downloaded a quantized 2:03or compressed version of a model that's capable of running on limited hardware, 2:07and we're also using a back end like llama C++ to run the model. 2:11So every time that we ask and chat with the model, for example, asking vim or emacs, 2:17What's happening is we're getting our response, but we're also making a post request to the API that's running on localhost. 2:25Pretty cool, right? 2:26So for our example, I ran the granite 3.1 model and as a 2:30developer, it has a lot of features that are quite interesting to me. 2:33So it supports 11 different languages so it could translate between Spanish and English and back and forth, 2:39and it's also optimized for enterprise specific tasks. 2:42This includes high benchmarks on RAG capabilities, 2:45which RAG allows us to use our unique data with the LLM by providing it in the context window of our queries, 2:51but also capabilities for agentic behavior and much more, 2:55but as always, it's good to keep your options open. 2:58The Ollama model catalog is quite impressive with models for embedding, vision, tools, and many more, 3:04but you could also import your own fine-tuned models, for example, 3:08or use them from Hugging Face by using what's known as the Ollama model file. 3:13So we've installed Olamma, we've chatted with the model running locally, and we've explored the model ecosystem, 3:18but there's a big question left, 3:20what about integrating an LLM like this into our existing application? 3:24So let me hop out of the chat window and let's make sure that the model is running locally on our system. 3:30So Ollama PS can show us the running models, 3:33and now that we have a model running on localhost, 3:36our application needs a way to communicate with this model in a standardized format. 3:40That's where we're going to be using what's known as Langchain 3:44and specifically Langchain for Java in our application, 3:47which is a framework that's grown in popularity and allows us to use 3:51a standardized API to make calls to the model from our application that's written in Java. 3:57Now, we're going to be using Quarkus, which is a Kubernetes optimized Java flavor 4:02that supports this Langchain for J extension in order to call our model from the application. 4:08Let's get started. 4:09So let's take a look at the application that we're currently working on. 4:12So I'll open it up here in the browser. 4:14Now, what's happening is that this fictitious organization Parasol 4:18is being overwhelmed by new insurance claims 4:22and could use the help of an AI, like a large language model, 4:26to help process this overwhelming amount of information and make better and quicker decisions, 4:31but how do we do that behind the scenes? 4:33So here in our project, we've added Lang chain for J as a dependency, and we're going to specify 4:38the URL as localhost on 4:41port 11434 in our application.properties, pointing to where our model is running on our machine. 4:47Now we're also gonna be using a web socket in order to make a post request to the model, 4:52and now our agents have AI capabilities, specifically a helpful assistant that can work with them to complete their job tasks. 5:00So let's ask the model to summarize the claim details. 5:03And there we go. 5:04In the form of tokens, we've made that request to the model. 5:08running with Ollama on our local machine and we're able to quickly prototype from our laptop. 5:13It's just as simple as that. 5:15So running AI locally can be really handy when it comes to prototyping, proof of concepts and much more, 5:21and another common use case is code assistance, 5:24connecting a locally running model to your IDE instead of using paid services. 5:29When it comes to production, however, you might need more advanced capabilities, but for getting started today, 5:34Ollama is a great pick for developers. 5:36So what are you working on or interested in? 5:39Let us know in the comments below, 5:40but thanks as always for watching and don't forget to like the video if you learned something today. 5:46Have a good one.