Recently, I needed to configure a new machine using an NVIDIA GPU to serve an Ollama instance that could be queried from other devices on the same local network.

This article walks you through the process I have been using to get it working.

Install NVIDIA drivers and utility packages

Before proceeding, we need to make sure the NVIDIA card is recognised and install a few utility packages. I use Arch Linux so I will use the pacman package manager.

Bash
sudo pacman -Syu nvidia-open

NB: I use the open version of the NVIDIA driver which is the proprietary drivers. I tried using the DKMS variant before that but I didn’t work (my NVIDIA card wouldn’t get recognised).

You can check the info on your card using the following command:

Bash
cat /proc/driver/nvidia/version

You can make sure your card is correctly recognised using the command:

Bash
# install if not already installed
nvidia-smi

# you can check the GPU number using the following command (you are going to need it later on)
nvidia-smi -L

Finally install utility packages

Bash
sudo pacman -Syu nvidia-container-toolkit nvidia-utils

Install and configure docker

Install docker

If not already installed, install docker and docker-compose packages

Bash
sudo pacman -Syu docker docker-compose

Configure docker to use GPUs

Bash
sudo nvidia-ctk runtime configure —runtime=docker
sudo systemctl restart docker.service

Launch Ollama as docker image

Using a docker run command

Use the following command to launch Ollama as docker image:

Bash
sudo docker run -d —gpus=all -e CUDA_VISIBLE_DEVICES=0 -e OLLAMA_FLASH_ATTENTION=1 -e OLLAMA_HOST=0.0.0.0 -e OLLAMA_ORIGINS="*" -v ollama:/root/.ollama -p 11434:11434 —name ollama ollama/ollama

Some explanation of the options:

  • CUDA_VISIBLE_DEVICES: the GPU number on which to run. You can have multiple, comma separated numbers (CUDA_VISIBLE_DEVICES=0,1).
  • OLLAMA_FLASH_ATTENTION: an optimisation parameter (optional).
  • OLLAMA_HOST: the network interfaces on which to listen for incoming requests (0.0.0.0 = all network interfaces).
  • OLLAMA_ORIGINS: allows to handle CORS requests (optional).
  • -v ollama:/root/.ollama: creates a shared volume between the host machine and the container.
  • -p 11434:11434: opens the port 11434 on the outside of the container and maps it to the port 11434 in the container (our Ollama instance can then be accessed on port 11434).
  • —name ollama: the name the container will have in docker.

Using a docker compose file

An alternative approach to launching the container is to use a docker compose file.

YAML
services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    ports:
      - "11434:11434"
    environment:
      - CUDA_VISIBLE_DEVICES=0
      - OLLAMA_FLASH_ATTENTION=1
      - OLLAMA_HOST=0.0.0.0
      - OLLAMA_ORIGINS="*"
    volumes:
      - ollama:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

volumes:
  ollama:
    driver: local

Run a model to try the setup

Bash
sudo docker exec -it ollama ollama run llama3.2

Check that model is loaded in the GPU and not CPU

Bash
sudo docker exec -it ollama ollama ps

You should see that the model has been loaded on the GPU and not on the CPU.

NB: If the model you are using needs more memory than your available vRAM, Ollama will use CPU and not GPU.

References


Leave a Reply

Your email address will not be published. Required fields are marked *