Ollama Docker Setup Guide
Ollama Docker Setup Guide
Overview
This guide explains how to set up Ollama with Llama 3.2 in Docker for use with the AI routing service.
Quick Start
1. Start Ollama Service
# Start Ollama container
docker-compose up -d ollama
# Wait for Ollama to be ready (about 60 seconds)
docker-compose logs -f ollama
2. Pull Llama 3.2 Model
# Use the setup script (recommended)
./scripts/setup-ollama.sh
# Or manually:
docker-compose exec ollama ollama pull llama3.2
3. Configure Environment
Add to your .env file:
AI_ROUTING_ENABLED=true
AI_ROUTING_PROVIDER=ollama
OLLAMA_BASE_URL=http://localhost:11434
AI_ROUTING_MODEL=llama3.2
4. Restart Service
docker-compose restart haya-routing
Docker Compose Configuration
The docker-compose.yml includes an Ollama service:
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:11434/api/tags || exit 1"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
Key points:
- Port: 11434 (standard Ollama port)
- Volume:
ollama_datapersists models between restarts - Health check: Verifies Ollama is ready
- GPU support: Configured for NVIDIA GPU (optional)
GPU vs CPU Mode
GPU Mode (Recommended)
If you have an NVIDIA GPU:
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
Requirements:
- NVIDIA GPU with CUDA support
nvidia-docker2installed- Docker with GPU support
Benefits:
- Much faster inference
- Better performance
- Can handle larger models
CPU Mode
If you don't have a GPU, the service will run on CPU (slower but works):
# Uncomment these in docker-compose.yml
environment:
- OLLAMA_NUM_PARALLEL=1
- OLLAMA_MAX_LOADED_MODELS=1
Note: CPU mode is slower but functional for development/testing.
Available Models
Llama 3.2 (Recommended)
docker-compose exec ollama ollama pull llama3.2
- Size: ~2GB
- Best for: General purpose, good balance
- Speed: Fast (with GPU)
Other Models
# Llama 3.2 (3B - smaller, faster)
docker-compose exec ollama ollama pull llama3.2:3b
# Llama 3.2 (1B - smallest, fastest)
docker-compose exec ollama ollama pull llama3.2:1b
# Mistral (alternative)
docker-compose exec ollama ollama pull mistral
# Code Llama (for code-focused tasks)
docker-compose exec ollama ollama pull codellama
Verification
Check Ollama is Running
# Check container status
docker-compose ps ollama
# Check logs
docker-compose logs ollama
# Test API
curl http://localhost:11434/api/tags
List Available Models
docker-compose exec ollama ollama list
Test Model
docker-compose exec ollama ollama run llama3.2 "Hello, how are you?"
Test via API
curl http://localhost:11434/api/generate \
-d '{
"model": "llama3.2",
"prompt": "What is routing?",
"stream": false
}'
Troubleshooting
Issue: Container won't start
Check logs:
docker-compose logs ollama
Common causes:
- Port 11434 already in use
- Insufficient memory
- Docker not running
Solution:
# Check if port is in use
lsof -i :11434
# Stop conflicting service or change port in docker-compose.yml
Issue: Model download fails
Solution:
# Retry download
docker-compose exec ollama ollama pull llama3.2
# Check disk space
df -h
# Check network connectivity
docker-compose exec ollama ping -c 3 8.8.8.8
Issue: Slow performance
Solutions:
-
Use GPU (if available):
# Install nvidia-docker2 # Restart Docker docker-compose up -d ollama -
Use smaller model:
docker-compose exec ollama ollama pull llama3.2:1b # Update .env: AI_ROUTING_MODEL=llama3.2:1b -
Increase timeout:
# In .env AI_ROUTING_TIMEOUT_MS=30000
Issue: Out of memory
Solutions:
- Use smaller model (llama3.2:1b or llama3.2:3b)
- Reduce parallel requests:
environment: - OLLAMA_NUM_PARALLEL=1 - Increase Docker memory limit
Performance Tips
1. Use GPU When Available
GPU provides 10-100x speedup:
# Check GPU availability
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
2. Choose Right Model Size
| Model | Size | Speed (CPU) | Speed (GPU) | Use Case |
|---|---|---|---|---|
| llama3.2:1b | ~700MB | Slow | Fast | Development |
| llama3.2:3b | ~2GB | Medium | Very Fast | Recommended |
| llama3.2 | ~2GB | Medium | Very Fast | Production |
3. Cache Models
Models are cached in ollama_data volume, so they persist between restarts.
4. Monitor Resources
# Check container resources
docker stats ollama
# Check disk usage
docker system df
Integration with AI Routing
Once Ollama is running:
1. Update Environment
# .env
AI_ROUTING_ENABLED=true
AI_ROUTING_PROVIDER=ollama
OLLAMA_BASE_URL=http://localhost:11434
AI_ROUTING_MODEL=llama3.2
2. Restart Service
docker-compose restart haya-routing
3. Test AI Routing
curl -X POST http://localhost:3000/api/v1/route/ai \
-H "Content-Type: application/json" \
-d '{
"tenant_id": "tenant-001",
"app_id": "app-001",
"roles": ["developer"],
"query": "Show network device inventory"
}'
Maintenance
Update Ollama
docker-compose pull ollama
docker-compose up -d ollama
Update Model
docker-compose exec ollama ollama pull llama3.2
Clean Up
# Remove unused models
docker-compose exec ollama ollama list
docker-compose exec ollama ollama rm <model-name>
# Remove volume (deletes all models)
docker-compose down -v
Backup Models
# Models are stored in ollama_data volume
docker run --rm -v haya-routing_ollama_data:/data -v $(pwd):/backup \
alpine tar czf /backup/ollama-backup.tar.gz /data
Resource Requirements
Minimum (CPU Mode)
- CPU: 2+ cores
- RAM: 4GB
- Disk: 5GB (for model)
- Speed: ~5-10 seconds per request
Recommended (GPU Mode)
- GPU: NVIDIA with 4GB+ VRAM
- CPU: 2+ cores
- RAM: 8GB
- Disk: 5GB
- Speed: ~1-2 seconds per request
Summary
✅ Ollama is configured in docker-compose.yml
✅ Runs on port 11434
✅ Models persist in ollama_data volume
✅ GPU support included (optional)
✅ Setup script available: ./scripts/setup-ollama.sh
Next steps:
- Run
docker-compose up -d ollama - Run
./scripts/setup-ollama.shto pull Llama 3.2 - Configure
.envwithAI_ROUTING_PROVIDER=ollama - Restart service and test!