How to run Ollama in a docker container using the GPU and make it accessible to external connections

Recently, I needed to configure a new machine using an NVIDIA GPU to serve an Ollama instance that could be queried from other devices on the same local network.

This article walks you through the process I have been using to get it working.

Install NVIDIA drivers and utility packages

Before proceeding, we need to make sure the NVIDIA card is recognised and install a few utility packages. I use Arch Linux so I will use the pacman package manager.

Bash

sudo pacman -Syu nvidia-open

sudo pacman -Syu nvidia-open

NB: I use the open version of the NVIDIA driver which is the proprietary drivers. I tried using the DKMS variant before that but I didn’t work (my NVIDIA card wouldn’t get recognised).

You can check the info on your card using the following command:

Bash

cat /proc/driver/nvidia/version

cat /proc/driver/nvidia/version

You can make sure your card is correctly recognised using the command:

Bash

# install if not already installed
nvidia-smi

# you can check the GPU number using the following command (you are going to need it later on)
nvidia-smi -L

# install if not already installed
nvidia-smi

# you can check the GPU number using the following command (you are going to need it later on)
nvidia-smi -L

Finally install utility packages

Bash

sudo pacman -Syu nvidia-container-toolkit nvidia-utils

sudo pacman -Syu nvidia-container-toolkit nvidia-utils

Install and configure docker

Install docker

If not already installed, install docker and docker-compose packages

Bash

sudo pacman -Syu docker docker-compose

sudo pacman -Syu docker docker-compose

Configure docker to use GPUs

Bash

sudo nvidia-ctk runtime configure —runtime=docker
sudo systemctl restart docker.service

sudo nvidia-ctk runtime configure —runtime=docker
sudo systemctl restart docker.service

Launch Ollama as docker image

Using a docker run command

Use the following command to launch Ollama as docker image:

Bash

sudo docker run -d —gpus=all -e CUDA_VISIBLE_DEVICES=0 -e OLLAMA_FLASH_ATTENTION=1 -e OLLAMA_HOST=0.0.0.0 -e OLLAMA_ORIGINS="*" -v ollama:/root/.ollama -p 11434:11434 —name ollama ollama/ollama

sudo docker run -d —gpus=all -e CUDA_VISIBLE_DEVICES=0 -e OLLAMA_FLASH_ATTENTION=1 -e OLLAMA_HOST=0.0.0.0 -e OLLAMA_ORIGINS="*" -v ollama:/root/.ollama -p 11434:11434 —name ollama ollama/ollama

Some explanation of the options:

CUDA_VISIBLE_DEVICES: the GPU number on which to run. You can have multiple, comma separated numbers (CUDA_VISIBLE_DEVICES=0,1).
OLLAMA_FLASH_ATTENTION: an optimisation parameter (optional).
OLLAMA_HOST: the network interfaces on which to listen for incoming requests (0.0.0.0 = all network interfaces).
OLLAMA_ORIGINS: allows to handle CORS requests (optional).
-v ollama:/root/.ollama: creates a shared volume between the host machine and the container.
-p 11434:11434: opens the port 11434 on the outside of the container and maps it to the port 11434 in the container (our Ollama instance can then be accessed on port 11434).
—name ollama: the name the container will have in docker.

Using a docker compose file

An alternative approach to launching the container is to use a docker compose file.

YAML

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    ports:
      - "11434:11434"
    environment:
      - CUDA_VISIBLE_DEVICES=0
      - OLLAMA_FLASH_ATTENTION=1
      - OLLAMA_HOST=0.0.0.0
      - OLLAMA_ORIGINS="*"
    volumes:
      - ollama:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

volumes:
  ollama:
    driver: local

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    ports:
      - "11434:11434"
    environment:
      - CUDA_VISIBLE_DEVICES=0
      - OLLAMA_FLASH_ATTENTION=1
      - OLLAMA_HOST=0.0.0.0
      - OLLAMA_ORIGINS="*"
    volumes:
      - ollama:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

volumes:
  ollama:
    driver: local

Run a model to try the setup

Bash

sudo docker exec -it ollama ollama run llama3.2

sudo docker exec -it ollama ollama run llama3.2

Check that model is loaded in the GPU and not CPU

Bash

sudo docker exec -it ollama ollama ps

sudo docker exec -it ollama ollama ps

You should see that the model has been loaded on the GPU and not on the CPU.

NB: If the model you are using needs more memory than your available vRAM, Ollama will use CPU and not GPU.

Mobol Tech

How to run Ollama in a docker container using the GPU and make it accessible to external connections

Install NVIDIA drivers and utility packages

Install and configure docker

Install docker

Configure docker to use GPUs

Launch Ollama as docker image

Using a docker run command

Using a docker compose file

Run a model to try the setup

Check that model is loaded in the GPU and not CPU

References

Leave a Reply