Ollama for Windows
Run powerful AI models on your own PC — free, private, and offline. Every model, every plugin, every setup step. Tested live on real hardware.
Ollama is free, open-source software (MIT License) that lets you run AI models like Llama 3, Mistral, DeepSeek, and Gemma directly on your Windows PC. No subscriptions, no cloud, no data leaves your machine. With a single command — ollama pull llama3.2 — you can download and run a model that rivals GPT-3.5 performance. If you have an NVIDIA GPU (like the RTX 3060 with 12GB VRAM), you get 50+ tokens per second. Even without a GPU, models run on CPU alone.
Why Run AI Locally?
Complete privacy. Your prompts, your documents, your code — nothing leaves your computer. No telemetry, no tracking, no terms of service changes. This matters for sensitive business data, personal projects, and any work where confidentiality is non-negotiable.
Zero cost. No monthly subscriptions. No per-token API charges. No rate limits. No "you've reached your daily limit" messages. Once you download a model, it is yours to use forever — unlimited requests, 24/7, completely free.
No internet required. Models run entirely offline after the initial download. Work on planes, in basements, during outages. Your AI assistant never goes down for maintenance.
Full control. Customize model behavior with system prompts. Create specialized models via Modelfiles. Adjust temperature, context length, and inference parameters. No content filters you did not choose. No arbitrary restrictions on your workflow.
- Ollama is 100% free — MIT License, open-source, no subscriptions, no API costs, even for commercial use
- One-command install — download from ollama.com, run the installer, pull a model, and start chatting in under 10 minutes
- 20+ free models — including Llama 3.3, Mistral-Nemo, Gemma 3, Phi-4, DeepSeek R1, Qwen 3, Kimi K2.5, and GPT-OSS
- GPU accelerated — NVIDIA RTX 3060 (12GB VRAM) runs 7-13B models at 50-70+ tokens/second with 98% GPU utilization
- Works without a GPU — CPU-only mode runs smaller models at readable speeds on any modern quad-core processor
- Free GUI apps — Open WebUI, Ollama desktop chat, LM Studio, Lobe Chat, Page Assist browser extension
- Developer-ready — REST API at localhost:11434, Python and JavaScript libraries, Docker support, IDE integrations
- Real hardware tested — this guide includes benchmarks from an HP OMEN 25L with i5-13400F, 48GB RAM, and RTX 3060
Install Ollama on Windows — 10 Minutes
Step 1: Download the Installer
Go to ollama.com/download/windows and download OllamaSetup.exe (approximately 1.2GB). Right-click the installer and select Run as administrator. The installer adds Ollama to your system tray — look for the llama icon near your clock.
Step 2: Verify the Installation
# Open Windows Terminal (or Command Prompt) and check version ollama --version # You should see something like: ollama version 0.16.3 # Check if the service is running ollama list # Empty list means Ollama is running but no models downloaded yet
Step 3: Pull Your First Model
# Download Llama 3.2 (Meta's latest small model, ~2GB) ollama pull llama3.2 # You'll see download progress: pulling manifest pulling 8934d96d3f08... 100% |████████████████████| 2.0 GB verifying sha256 digest writing manifest success
Step 4: Start Chatting
# Start an interactive chat session ollama run llama3.2 # Type your question and press Enter >>> What is vibe coding? # The model responds in real-time, generated on YOUR hardware # Type /bye to exit the chat
Step 5: Manage Your Models
# List all downloaded models with sizes ollama list # Show model details (architecture, parameters, license) ollama show llama3.2 # See which models are loaded in GPU/CPU memory ollama ps # Remove a model to free disk space ollama rm llama3.2 # Update a model to latest version ollama pull llama3.2 # Launch integrated apps (new in 2026) ollama launch openclaw # AI assistant
OS: Windows 10/11 64-bit (Home, Pro, Enterprise, or Education — version 21H2+). RAM: 8GB minimum (16GB+ recommended). Disk: 12GB free for Ollama + models (SSD strongly recommended — NVMe ideal). GPU: Optional but recommended — NVIDIA with compute capability 5.0+ and driver 531+. Internet: Required only for downloading models.
Real-World Hardware: Running Ollama Live
This is not a theoretical guide. Ollama is running right now on the author's personal workstation — an HP OMEN 25L Gaming Desktop. Here are the exact specifications and what models it can handle.
HP OMEN 25L Gaming Desktop GT15-1xxx
Running Ollama with Mistral-Nemo — verified live February 2026
What This Hardware Can Run
With 12GB of VRAM on the RTX 3060, this system comfortably runs any model up to 13 billion parameters at full GPU speed. The 48GB of system RAM provides generous overflow capacity — when a model is too large for VRAM alone, Ollama automatically splits layers between GPU and CPU. The i5-13400F (10 cores, 16 threads at 2.5GHz base) handles CPU inference at respectable speeds for models under 8B parameters even without GPU involvement.
| Model Size | Fits in VRAM? | Speed (est.) | Example Models |
|---|---|---|---|
| 3B | Yes — fully | 80-100+ tok/s | Llama 3.2:3b, Phi-3 Mini, Gemma 2:2b |
| 7-8B | Yes — fully | 50-70+ tok/s | Llama 3.1:8b, Mistral 7B, Gemma 2:9b |
| 12-13B | Yes — tight fit | 30-45 tok/s | Mistral-Nemo:12b, CodeLlama:13b |
| 14B | Partial — GPU+CPU split | 15-25 tok/s | Phi-4, Qwen 2.5:14b |
| 30-34B | Mostly CPU (48GB RAM helps) | 5-10 tok/s | CodeLlama:34b, Yi:34b |
| 70B+ | CPU only — very slow | 1-3 tok/s | Llama 3.3:70b (possible but slow) |
The RTX 3060 12GB is ideal for 7-13B parameter models in Q4_K_M quantization (Ollama's default). This is where you get the best balance of quality and speed. Models like Mistral-Nemo 12B, Llama 3.1 8B, and Gemma 2 9B deliver fast, high-quality responses fully accelerated on the GPU. The 48GB system RAM is a major bonus — it lets you run larger models via GPU+CPU split when you need extra capability.
Free Models — General Purpose
Every model below is free to download, free to use, and free for commercial work. Install any model with a single command: ollama pull model-name. Browse the full library at ollama.com/library.
Meta's flagship open model. GPT-4-class performance in the 70B variant, with excellent instruction following, reasoning, and multilingual support. The 8B version runs smoothly on consumer GPUs.
12B parameter model that fits perfectly in 12GB VRAM. Excellent for fast responses, translation, and text summarization. One of the best quality-per-VRAM models available. Currently running on the author's OMEN 25L.
Google's latest open model family. The 27B variant offers strong reasoning at a compact size. The 4B is excellent for resource-constrained setups. Supports 140+ languages with built-in vision capability.
Microsoft's state-of-the-art small model. 14B parameters with reasoning performance that punches well above its weight class. Excels at math, logic, science, and structured tasks.
Alibaba's latest generation spanning 0.6B to 235B parameters with dense and MoE architectures. Supports 201 languages and 128K context. The 8B variant is an excellent all-rounder for consumer GPUs.
Deep reasoning model with chain-of-thought capabilities. Shows its thinking process step by step. Strong at math, logic, and complex analysis. Distilled versions available in 1.5B to 70B sizes.
1T total parameters (32B active) via Mixture-of-Experts. The strongest open-source coding model with visual-to-code generation. Agent Swarm mode for parallel task execution. Excels at front-end development.
OpenAI's first open-weight model since GPT-2. Available in 120B and 20B variants under Apache 2.0 license. The 20B version runs locally on consumer hardware with 4-bit quantization.
Top-ranked on Quality Index (49.64), 203K context, 77.8% SWE-bench Verified. Excellent for agent execution, long coding tasks, and reliable daily development assistance. Open license.
State-of-the-art model designed for real-world productivity and coding tasks. One of the newest additions to the Ollama library, optimized for practical everyday workflows.
Free Models — Coding Specialists
These models are specifically trained or fine-tuned for code generation, debugging, refactoring, and software engineering tasks. Perfect for vibe coding without an internet connection.
Meta's dedicated code model. Supports code generation, completion, infilling, and instruction-following across many programming languages. The 13B variant fits on an RTX 3060 and handles most coding tasks well.
Coding-focused model from Alibaba's Qwen team, optimized for agentic coding workflows and local development. Among the newest additions to Ollama's library in February 2026.
Purpose-built for code with strong multi-language support across Python, JavaScript, TypeScript, Java, C++, and more. Excellent at understanding existing codebases and generating contextually aware solutions.
Transparently trained open code model available in 3B, 7B, and 15B sizes. Trained on The Stack v2 dataset with full data transparency. Strong at code completion and fill-in-the-middle tasks.
Free Models — Vision & Multimodal
These models can understand images alongside text — describe photos, read documents, analyze charts, and convert screenshots to code.
The pioneering open-source multimodal model. Combines a vision encoder with language understanding for general-purpose visual + text tasks. Great for image description, visual Q&A, and document analysis.
Meta's multimodal models in 11B and 90B sizes. Instruction-tuned for image reasoning tasks including chart reading, document understanding, visual question answering, and image captioning.
Specialized multimodal OCR model for complex document understanding. Built on the GLM-V encoder-decoder architecture. Excellent for extracting text from scanned documents, receipts, and handwriting.
Free GUI Apps & Plugins for Ollama
Ollama runs great from the terminal, but these free tools give you visual interfaces ranging from ChatGPT-like web apps to browser extensions and IDE integrations. All work with your locally running Ollama instance.
Desktop & Web Interfaces
Ollama Desktop Chat
Ollama now ships with a built-in desktop chat interface — no separate installation needed. Launch it from the system tray icon. Clean, minimal interface with model switching, conversation history, and settings. The easiest way to get started.
Included with OllamaOpen WebUI
The most popular and feature-rich Ollama GUI. ChatGPT-like web interface with RAG (upload documents for context), web search, image generation (DALL-E, ComfyUI), multi-model conversations, custom model builder, and RBAC for teams. Requires Docker.
github.com/open-webuiLM Studio
Polished desktop app for discovering, downloading, and running local models. Beautiful model catalog with search and filtering. Friendly chat interface with conversation management. Works alongside Ollama or standalone. Windows, macOS, Linux.
lmstudio.aiLobe Chat
Privacy-focused ChatGPT-like UI framework. Sleek interface with voice conversations, text-to-image generation, and plugin support. Deploy locally via Docker or one-click on Vercel. Progressive Web App support for mobile access.
github.com/lobehubAskimo
Native desktop AI workspace with Ollama integration. Features RAG for project files, CLI automation, and multi-model support. Built as a true desktop app (not web-based) for fast, responsive local AI work. Windows, macOS, Linux.
askimo.chatMsty
Cross-platform local-first UI with conversational branches and Obsidian vault integration for knowledge stacks. Lets you organize AI conversations by project and branch off into different directions from any point.
msty.appBrowser Extensions & IDE Integrations
Page Assist
Open-source browser extension for running Ollama models directly in Chrome or Firefox. Manage models, upload files, enable web search — all from a sidebar in your browser. No separate app needed.
GitHub: page-assistContinue
Open-source AI code assistant for VS Code and JetBrains IDEs. Connect it to your local Ollama instance for private, offline code assistance. Tab completion, chat, and inline editing — all powered by your local models. 20K+ GitHub stars.
github.com/continuedevCline
VS Code extension for autonomous multi-file and whole-repo coding. Features Plan and Act modes — plan your changes first, then execute. Supports Ollama as a backend for fully local, private AI-assisted development.
github.com/clineOpenClaw
Ollama's integrated personal AI assistant. Automates work, answers questions, handles tasks — connects to WhatsApp, Telegram, Slack, and Discord. Install with one command: ollama launch openclaw.
Developer Tools
Ollama REST API
Every Ollama install exposes a REST API at localhost:11434. Use it to integrate local AI into any application — web apps, scripts, automation workflows. Full chat, generate, embed, and model management endpoints.
Python & JavaScript Libraries
Official client libraries for Python (pip install ollama) and JavaScript (npm install ollama). Build local AI applications with clean, typed APIs. Full streaming support for real-time token output.
# Install Open WebUI with Docker (GPU support) docker run -d -p 3000:8080 --gpus=all \ -v ollama:/root/.ollama \ -v open-webui:/app/backend/data \ --name open-webui --restart always \ ghcr.io/open-webui/open-webui:ollama # CPU-only version (no --gpus flag) docker run -d -p 3000:8080 \ -v ollama:/root/.ollama \ -v open-webui:/app/backend/data \ --name open-webui --restart always \ ghcr.io/open-webui/open-webui:ollama # Access at http://localhost:3000
VRAM & Hardware Requirements Guide
The golden rule of local AI: VRAM is king. The more GPU memory you have, the larger and faster the models you can run. But you do not need top-of-the-line hardware — here is exactly what each tier can handle.
VRAM Requirements by Model Size
Ollama uses Q4_K_M quantization by default, which compresses models to roughly 25% of their full-precision size. A good rule of thumb: multiply the quantized model file size by 1.2x to account for the KV cache (context window memory).
| Model Size | Download Size | VRAM Needed | RAM Needed (CPU) | Best GPU Match |
|---|---|---|---|---|
| 1-3B | 0.7 – 2 GB | 2 – 4 GB | 8 GB | Any GPU / CPU-only |
| 7-8B | 4 – 5 GB | 6 – 8 GB | 16 GB | RTX 3060 (12GB), RTX 4060 (8GB) |
| 12-13B | 7 – 8 GB | 9 – 12 GB | 16 – 32 GB | RTX 3060 (12GB) ← Your GPU |
| 14B | 8 – 9 GB | 10 – 14 GB | 32 GB | RTX 4080 (16GB), RTX 3090 (24GB) |
| 30-34B | 18 – 20 GB | 20 – 24 GB | 32 – 64 GB | RTX 3090/4090 (24GB) |
| 70B | 38 – 42 GB | 40 – 48 GB | 64 GB+ | 2x RTX 3090 or A100 (40GB) |
Hardware Tiers for Ollama
Budget Build — $0 Extra (CPU-Only)
Any modern quad-core CPU with 8-16GB RAM. Runs 3B-8B models at readable speeds (5-15 tokens/second). Perfect for testing and learning. No GPU purchase needed — just install Ollama and go.
Mid-Range Build — RTX 3060 / 4060
This Guide's SetupThe sweet spot for most vibe coders. An RTX 3060 (12GB VRAM) or RTX 4060 Ti (16GB) with 16-48GB RAM gives you fast inference on 7-13B models with full GPU acceleration. This is the setup behind this guide. Models like Mistral-Nemo 12B, Llama 3.1 8B, and Gemma 3 run at 50-70+ tokens per second.
High-End Build — RTX 4090 / 3090
24GB VRAM opens up 30-34B models at full GPU speed and 70B models via GPU+CPU split. With 64GB+ RAM, you can run essentially any open model. Two GPUs double your VRAM — Ollama supports multi-GPU automatically.
SSD is critical — model loading from an NVMe SSD takes seconds; from an HDD it takes minutes. Close GPU-hungry apps before running large models — games, video editors, and browsers with GPU acceleration eat into your available VRAM. KV cache quantization (set OLLAMA_KV_CACHE_TYPE=q8_0) can cut context window memory usage in half, letting you fit larger contexts on smaller GPUs. Disk space: plan for 2x the model size in free space during download.
Pro Tips for Ollama Power Users
1. Create Custom Models with Modelfiles
A Modelfile lets you create specialized AI assistants by combining a base model with custom system prompts, parameters, and behavior. Save your configuration once and load it by name forever.
# Save this as "Modelfile" (no extension) FROM mistral-nemo SYSTEM """You are a senior Shopify developer specializing in Liquid templates, custom sections, and theme development. Always write production-ready code with proper error handling. Use vanilla JS only — no jQuery. Scope all CSS under unique wrapper classes.""" PARAMETER temperature 0.3 PARAMETER num_ctx 8192 # Create and run your custom model: # ollama create shopify-dev -f Modelfile # ollama run shopify-dev
2. Use the API for Automation
Every Ollama install runs a local API server. Use it to integrate AI into your scripts, apps, and workflows — no external dependencies, no API keys, no rate limits.
from ollama import chat response = chat(model='mistral-nemo', messages=[ {'role': 'user', 'content': 'Write a Shopify section schema for a product grid'} ]) print(response.message.content)
3. Keep Multiple Models for Different Tasks
No single model does everything best. Keep 2-3 models installed and switch between them. Use Mistral-Nemo for fast general tasks, CodeLlama for programming, DeepSeek R1 for complex reasoning, and LLaVA when you need image understanding. Switch instantly with ollama run model-name.
4. Set Environment Variables for Performance
# Set in System Properties → Environment Variables # Or run in PowerShell before starting Ollama: # Reduce KV cache memory (fit more context in less VRAM) $env:OLLAMA_KV_CACHE_TYPE = "q8_0" # Change model storage location (useful if C: is small) $env:OLLAMA_MODELS = "D:\ollama\models" # Force CPU-only mode (if GPU causes issues) $env:OLLAMA_NO_GPU = "1" # Use specific GPUs in multi-GPU setups $env:CUDA_VISIBLE_DEVICES = "0,1"
5. Ollama + Docker = Production Ready
For serious deployments, run Ollama inside Docker. This isolates the environment, makes it easy to update, and pairs perfectly with Open WebUI for a polished user experience. Docker Desktop for Windows includes GPU passthrough support for NVIDIA GPUs.
6. Free Your Storage — Manage Disk Space
Models are large files. A single 13B model is about 7-8GB. Regularly check your installed models with ollama list and remove unused ones with ollama rm model-name. Move your model storage to a different drive by setting the OLLAMA_MODELS environment variable to a path on your largest drive.
Use Ollama as a Local AI Engine for Your Own Apps
One of Ollama's most powerful and underrated features is that every installation runs a fully functional API server on your machine at http://localhost:11434. This means any app, script, or system you build can call Ollama the same way it would call the OpenAI or Anthropic API — except it is free, private, and runs entirely on your hardware. No API keys. No rate limits. No per-token billing.
The author's Synthetic Director — a 13-platform social media content generation system — uses this exact architecture. The system calls a locally running Ollama instance to generate content drafts, analyze trends, and enforce brand guidelines via the REST API. By pointing the app at localhost:11434 instead of a cloud API, the entire content pipeline runs with zero API costs, zero data exposure, and zero dependency on external services being online. Any AI-powered application you build can do the same.
How It Works: The Architecture
When Ollama starts (it auto-launches on Windows boot via the system tray), it spins up a local HTTP server. Any application on your machine — a Python script, a Node.js app, a React frontend, a Shopify automation tool, a custom AGI pipeline — can send HTTP requests to this server and receive AI-generated responses. The API is OpenAI-compatible, meaning many tools that work with OpenAI's API can be pointed at Ollama with a one-line configuration change.
┌─────────────────────────────────────────────────────────┐ │ YOUR APPS & SYSTEMS │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Synthetic │ │ Custom │ │ VS Code + │ │ │ │ Director │ │ Python/Node │ │ Continue │ │ │ │ v10.0 │ │ Scripts │ │ Extension │ │ │ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌─────────────────────────────────────────────────┐ │ │ │ http://localhost:11434/api/chat │ │ │ │ Ollama REST API (always running) │ │ │ └──────────────────────┬──────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────┐ │ │ │ LOCAL MODELS (Mistral-Nemo / Llama / Gemma) │ │ │ │ Running on YOUR GPU (RTX 3060) + YOUR RAM │ │ │ └─────────────────────────────────────────────────┘ │ │ │ │ 🔒 Everything stays on your machine. Zero cloud calls. │ └─────────────────────────────────────────────────────────┘
Step-by-Step: Connect Your App to Ollama
1. Verify Ollama Is Running
# Check if Ollama's API server is responding curl http://localhost:11434 # Should return: "Ollama is running" # Or in PowerShell: Invoke-WebRequest -Uri http://localhost:11434 | Select-Object -ExpandProperty Content # List available models via API curl http://localhost:11434/api/tags
2. Call the Chat API from Your App
The /api/chat endpoint accepts the same message format as OpenAI's Chat Completions API. Send a JSON body with your model name, messages array, and optional parameters.
curl http://localhost:11434/api/chat -d '{ "model": "mistral-nemo", "messages": [ { "role": "system", "content": "You are a Shopify content writer for a sustainable fashion brand." }, { "role": "user", "content": "Write an Instagram caption for our new organic cotton hoodie." } ], "stream": false }'
3. Python Integration
Use the official ollama Python library for the cleanest integration — or call the REST API directly with requests if you prefer no dependencies.
# pip install ollama from ollama import chat # Simple chat — works exactly like calling a cloud API response = chat( model='mistral-nemo', messages=[ {'role': 'system', 'content': 'You are a senior developer.'}, {'role': 'user', 'content': 'Review this code for bugs and security issues.'} ] ) print(response.message.content)
import requests, json # Call Ollama the same way you'd call OpenAI — just change the URL response = requests.post( 'http://localhost:11434/api/chat', json={ 'model': 'mistral-nemo', 'messages': [ {'role': 'user', 'content': 'Generate 5 product descriptions.'} ], 'stream': False } ) result = response.json() print(result['message']['content'])
4. JavaScript / Node.js Integration
// npm install ollama import ollama from 'ollama'; const response = await ollama.chat({ model: 'mistral-nemo', messages: [ { role: 'system', content: 'You are an AI content strategist.' }, { role: 'user', content: 'Plan a week of social media posts.' } ] }); console.log(response.message.content);
// Works in Node.js 18+, Deno, Bun, or any modern runtime const response = await fetch('http://localhost:11434/api/chat', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ model: 'mistral-nemo', messages: [{ role: 'user', content: 'Your prompt here' }], stream: false }) }); const data = await response.json(); console.log(data.message.content);
5. Use Ollama as an OpenAI Drop-In Replacement
Ollama's API is compatible with OpenAI's Chat Completions format. Many apps, libraries, and frameworks that use OpenAI can be redirected to your local Ollama instance by changing just the base URL. This is the fastest way to integrate local AI into existing projects.
# pip install openai from openai import OpenAI # Point the OpenAI client at your local Ollama client = OpenAI( base_url='http://localhost:11434/v1', api_key='ollama' # Required by the library, but Ollama ignores it ) response = client.chat.completions.create( model='mistral-nemo', messages=[ {'role': 'system', 'content': 'You are a helpful assistant.'}, {'role': 'user', 'content': 'Explain how Ollama works.'} ] ) print(response.choices[0].message.content) # That's it. Same OpenAI library, same code pattern. # Just change the base_url to localhost:11434/v1 # Works with LangChain, LlamaIndex, CrewAI, and more.
Key API Endpoints Reference
| Endpoint | Method | Purpose |
|---|---|---|
/api/chat |
POST | Send chat messages and get AI responses (supports streaming) |
/api/generate |
POST | Single-turn text generation (no message history) |
/api/embed |
POST | Generate vector embeddings for RAG and semantic search |
/api/tags |
GET | List all locally installed models |
/api/show |
POST | Get model details (architecture, parameters, license) |
/api/pull |
POST | Download a model from the Ollama library |
/api/delete |
DELETE | Remove a locally installed model |
/v1/chat/completions |
POST | OpenAI-compatible endpoint (drop-in replacement) |
The local API unlocks unlimited possibilities. Build AI content generators like The Synthetic Director. Create Shopify automation tools that write product descriptions. Build customer support bots. Create code review pipelines. Generate SEO content at scale. Feed screenshots to vision models for automated QA. The same API that powers professional AI systems costs you $0 per month when running through Ollama on your own hardware. The only limit is your imagination and your VRAM.
Ollama turns any Windows PC into a private AI workstation. With an RTX 3060 and 12GB of VRAM, you can run models that rival ChatGPT's GPT-3.5 performance — completely free, completely private, completely offline. The ecosystem of free GUI apps, IDE integrations, and developer tools means you are not limited to a terminal. Whether you are a vibe coder, a developer building AI features, or someone who just wants a private AI assistant, Ollama is the foundation. Install it in 10 minutes, pull your first model, and start building.
