Recently, I needed to configure a new machine using an NVIDIA GPU to serve an Ollama instance that could be queried from other devices on the same local network.
This article walks you through the process I have been using to get it working.
Install NVIDIA drivers and utility packages
Before proceeding, we need to make sure the NVIDIA card is recognised and install a few utility packages. I use Arch Linux so I will use the pacman package manager.
sudo pacman -Syu nvidia-openNB: I use the open version of the NVIDIA driver which is the proprietary drivers. I tried using the DKMS variant before that but I didn’t work (my NVIDIA card wouldn’t get recognised).
You can check the info on your card using the following command:
cat /proc/driver/nvidia/versionYou can make sure your card is correctly recognised using the command:
# install if not already installed
nvidia-smi
# you can check the GPU number using the following command (you are going to need it later on)
nvidia-smi -LFinally install utility packages
sudo pacman -Syu nvidia-container-toolkit nvidia-utilsInstall and configure docker
Install docker
If not already installed, install docker and docker-compose packages
sudo pacman -Syu docker docker-composeConfigure docker to use GPUs
sudo nvidia-ctk runtime configure —runtime=docker
sudo systemctl restart docker.serviceLaunch Ollama as docker image
Using a docker run command
Use the following command to launch Ollama as docker image:
sudo docker run -d —gpus=all -e CUDA_VISIBLE_DEVICES=0 -e OLLAMA_FLASH_ATTENTION=1 -e OLLAMA_HOST=0.0.0.0 -e OLLAMA_ORIGINS="*" -v ollama:/root/.ollama -p 11434:11434 —name ollama ollama/ollamaSome explanation of the options:
- CUDA_VISIBLE_DEVICES: the GPU number on which to run. You can have multiple, comma separated numbers (CUDA_VISIBLE_DEVICES=0,1).
- OLLAMA_FLASH_ATTENTION: an optimisation parameter (optional).
- OLLAMA_HOST: the network interfaces on which to listen for incoming requests (0.0.0.0 = all network interfaces).
- OLLAMA_ORIGINS: allows to handle CORS requests (optional).
- -v ollama:/root/.ollama: creates a shared volume between the host machine and the container.
- -p 11434:11434: opens the port 11434 on the outside of the container and maps it to the port 11434 in the container (our Ollama instance can then be accessed on port 11434).
- —name ollama: the name the container will have in docker.
Using a docker compose file
An alternative approach to launching the container is to use a docker compose file.
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
ports:
- "11434:11434"
environment:
- CUDA_VISIBLE_DEVICES=0
- OLLAMA_FLASH_ATTENTION=1
- OLLAMA_HOST=0.0.0.0
- OLLAMA_ORIGINS="*"
volumes:
- ollama:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
ollama:
driver: localRun a model to try the setup
sudo docker exec -it ollama ollama run llama3.2Check that model is loaded in the GPU and not CPU
sudo docker exec -it ollama ollama psYou should see that the model has been loaded on the GPU and not on the CPU.
NB: If the model you are using needs more memory than your available vRAM, Ollama will use CPU and not GPU.

Leave a Reply