Ollama on VPS: Run Language Models on Your Own Server

11.05.2026
18:36

Ollama turns running large language models from a DevOps project into a few terminal commands. Pull a model, run it — Llama, Mistral, Gemma, DeepSeek, Qwen, Phi all work locally on your own hardware, no external APIs, no data leaving your server.

On a personal machine, Ollama is fine for tinkering. On a VPS it becomes something more useful: a persistent API for your applications, a private assistant your whole team can reach, a RAG pipeline over internal documents, automation that doesn't depend on OpenAI's uptime or pricing changes.

This guide covers installing Ollama on THE.Hosting VPS, choosing the right plan for your target model size, exposing the API safely, and adding a web interface.

Why a VPS and Not Your Local Machine

A laptop with Ollama running goes to sleep. Colleagues can't reach your local IP. The moment your machine locks, the API is gone.

A VPS runs 24/7, has a stable public IP, and doesn't care about your home internet. For team use or anything embedded in an application, it's the only option that makes sense.

There's also a privacy argument. Prompts, documents, and responses stay on your server — they never touch a third-party API. For legal, medical, or financial content that can't leave your control, running locally isn't optional.

How Much RAM You Actually Need

RAM is the binding constraint. The model has to fit — entirely — into memory. If it doesn't, Ollama automatically offloads layers to disk and generation speed drops by 5–10x. That's the difference between a usable tool and something you'll give up on.

Approximate requirements by model size (quantized Q4 versions):

Model Parameters Minimum RAM Comfortable
Llama 3.2, Phi-3 mini 1–3B 4 GB 8 GB
Llama 3.1, Mistral, Gemma 3 7–8B 8 GB 16 GB
Llama 3.1, Gemma 3 13B 16 GB 32 GB
Qwen 2.5, DeepSeek R1 32B 32 GB 64 GB
Llama 3.3, DeepSeek R1 70B 64 GB 128 GB

Full-precision (FP16) models need 2–3× more RAM than these figures.

CPU: for serving multiple users concurrently, 4–8 vCPU is the practical minimum. Single-user or testing workloads run fine on 2 vCPU.

GPU: Ollama works without one — pure CPU inference is supported. But speed on CPU tops out at 3–8 tokens/sec for a 7B model, which is usable but noticeably slow. An NVIDIA GPU (CUDA 5.0+ required) pushes that to 30–60 tokens/sec for the same model. If response speed matters, a dedicated server with GPU is the right call.

VPS tier recommendations at THE.Hosting:

  • Experiments with 3B models — 2 vCPU / 4 GB RAM
  • 7B models for personal use — 4 vCPU / 8 GB RAM
  • 7–13B models for team use — 4–8 vCPU / 16–32 GB RAM
  • 30–70B models — Dedicated server with 64 GB+ RAM

Starting VPS from €5.77/month. After KYC verification, a trial tier is available at €1/month for up to 6 months.

Installing Ollama

Connect to the server:

ssh root@your-IP-address

Update the system:

apt update && apt upgrade -y

Run the official install script — it detects your architecture and GPU automatically:

curl -fsSL https://ollama.com/install.sh | sh

Verify the service is running:

systemctl status ollama

active (running) means you're good. Ollama is now listening on 127.0.0.1:11434.

Pulling and Running Models

Download a model from the official library at ollama.com/library:

ollama pull llama3.2:3b

Start an interactive chat session:

ollama run llama3.2:3b

Type your prompt and the model responds in the terminal. To exit, type /bye.

Good models to start with:

For general use and quick experiments, llama3.2:3b or llama3.2:1b are fast and low-demand. For coding, qwen2.5-coder:7b or deepseek-coder-v2:16b are strong choices. For structured reasoning, deepseek-r1:8b is worth trying. For multilingual work, qwen2.5:7b handles non-Latin scripts better than most.

List downloaded models:

ollama list

Remove a model to free up disk space:

ollama rm llama3.2:3b

Exposing the API for External Access

By default Ollama only accepts connections from localhost. To reach it from an application or another server, you need to change that.

Open the service override file:

systemctl edit ollama

Add these lines:

[Service]
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_ORIGINS=*"

Reload and restart:

systemctl daemon-reload
systemctl restart ollama

The API is now reachable on port 11434. Test it from another machine:

curl http://your-IP:11434/api/generate \
  -d '{"model": "llama3.2:3b", "prompt": "Hello!", "stream": false}'

Security note: never leave port 11434 open to the public internet. Ollama has no built-in authentication. Lock it down via firewall and put Nginx in front:

ufw allow from your-trusted-IP to any port 11434
ufw deny 11434

Nginx Reverse Proxy with Basic Auth

For HTTPS access with password protection:

Install Nginx and the password utility:

apt install nginx apache2-utils -y

Create a password file:

htpasswd -c /etc/nginx/.htpasswd your_username

Create the virtual host config:

nano /etc/nginx/sites-available/ollama

Content:

server {
    listen 443 ssl;
    server_name ollama.your-domain.com;

    ssl_certificate /etc/letsencrypt/live/ollama.your-domain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/ollama.your-domain.com/privkey.pem;

    auth_basic "Ollama API";
    auth_basic_user_file /etc/nginx/.htpasswd;

    location / {
        proxy_pass http://127.0.0.1:11434;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

Enable and get a certificate:

ln -s /etc/nginx/sites-available/ollama /etc/nginx/sites-enabled/
apt install certbot python3-certbot-nginx -y
certbot --nginx -d ollama.your-domain.com
systemctl reload nginx

Adding a Web Interface: Open WebUI

Open WebUI is a browser-based chat interface for Ollama. It looks and feels like ChatGPT, runs entirely on your server, and connects to your local models.

Install Docker:

curl -fsSL https://get.docker.com | sh

Start Open WebUI:

docker run -d \
  --network=host \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Open WebUI runs on port 3000. Wrap it in Nginx with HTTPS using the same approach as above, and you have a private ChatGPT-style interface on your own domain.

Choosing a Server Location

Pick a location close to the people who will be using the API — latency matters when you're waiting on inference responses.

For European teams or GDPR-sensitive workloads: Germany (Frankfurt) keeps data within the EU. Netherlands (Meppel) has excellent pan-European connectivity.

For CIS teams: Finland (Helsinki) or Moldova (Chișinău), the latter with dedicated server options.

For Asia-Pacific users: Japan (Tokyo) for East Asia, Hong Kong for access to China and Southeast Asia.

For North American workloads: USA (New Jersey, Secaucus).

Common Issues

Generation is extremely slow. The model doesn't fit in RAM and is offloading to disk. Run ollama ps — it shows how many layers are in RAM versus on CPU. Fix: use a smaller model or upgrade to more RAM.

Failed to connect when calling the API from outside. Ollama is still bound to localhost. Add OLLAMA_HOST=0.0.0.0 to the service configuration and restart.

Model won't download. Check available disk space with df -h. Models range from 1–2 GB for small 1-3B variants to 40+ GB for 70B models. Expand your disk or delete unused models first.

Open WebUI can't reach Ollama. With --network=host, Open WebUI talks to Ollama via localhost:11434. Confirm Ollama is running and listening: curl localhost:11434.

Ready to run your own language models?

THE.Hosting VPS — KVM, Ubuntu, 50+ locations worldwide. Match the plan to your model size.

Starting from €5.77/month, trial tier at €1/month with KYC for up to 6 months. For large models, Dedicated servers with high RAM are available.

Choose a VPS for Ollama 
Dedicated Servers

FAQ:

Does Ollama work without a GPU?

Yes, CPU-only mode is fully supported. Generation speed will be lower — roughly 3–8 tokens/sec for a 7B model compared to 30–60 tokens/sec with a good NVIDIA GPU. For personal use and experimentation, CPU is fine.

Which model works best for non-English languages?

Qwen 2.5 (7B, 14B, 32B) consistently leads on non-Latin scripts including Russian, Chinese, and Arabic. Recent Llama 3.x versions are solid too. DeepSeek R1 handles multilingual prompts well. Gemma 3 is notably weaker outside English.

Can Ollama replace the OpenAI API in my code?

Yes. Ollama exposes an OpenAI-compatible endpoint at http://localhost:11434/v1/. Point any OpenAI client library, LangChain, LlamaIndex, or n8n to that URL as the base_url and use any string as the API key. Most integrations work without code changes.

Can multiple models run at the same time?

Yes, as long as there's enough RAM for all of them. Ollama loads models on demand and automatically unloads models that haven't been used in 5 minutes.

Is it safe to expose port 11434 publicly?

No — Ollama has no built-in authentication. Use Nginx with Basic Auth in front of it, restrict access by IP via firewall, or tunnel through WireGuard or SSH.

THE.Hosting Useful Links:

Ollama Resources:

Other articles