I spent some time pocking around with LLMs and a setup which is called RAG (Retrieval-Augmented Generation) and I stumbled on some pretty interesting benchmarks on MacOS.

I’ve been a huge fan of MacBooks for the last 10+ years and especially since they launched their Apple Silicon chips. When I played around with LLMs, I tried out my setup on my M1 Pro Macbook with 32Gb of RAM.

Here is what I saw.

Note: I won’t go into the setup since this is a totally different topic. For sure I’ll make another, more complete post later. But in order to understand the test results, I will simply detail the test setup I have been working with.

The test setup

My tests included the following:

  • A docker compose file with:
    • one container with the vector database (chromadb in my case).
    • one container running Ollama. Inside that container, I pulled 2 models
      • « nomic-embed-text » used to generate the vectors and retrieve the embeddings.
      • « llama3.2 » used as the main model to query.
  • A python script to feed the database with knowledge (and generate the embeddings).
  • Another python script to query the model (using the embeddings and the user request).

The test case

After having a functional database, I started querying the model with some simple questions, for which the answer could be found inside the knowledge base.

I printed how long it took to get answers from my local model, with and without RAG.

The test results with the initial setup

  • Import docs
    • Importing my docs into the vector database took between 918 seconds (15,3 minutes) and 1002 seconds (16,7 minutes). It seemed quite long, especially since the knowledge base is not that big.
  • Search:
    • Question 1
      • Get embeddings: took 0,3 seconds
      • Get an answer from the LLM without using the RAG: took 22 seconds
      • Get an answer from the LLM with using the RAG: took 85 seconds
    • Question 2
      • Get embeddings: took 0,7 seconds
      • Get an answer from the LLM without using the RAG: took 50 seconds
      • Get an answer from the LLM with using the RAG: took 125 seconds
    • Question 3
      • Get embeddings: took 0,2 seconds
      • Get an answer from the LLM without using the RAG: took 31 seconds
      • Get an answer from the LLM with using the RAG: took 112 seconds
    • Question 4
      • Get embeddings: took 0,2 seconds
      • Get an answer from the LLM without using the RAG: took 3 seconds (with no response)
      • Get an answer from the LLM with using the RAG: took 85 seconds

Preliminary conclusion

Something seemed odd. Even if my Macbook was not latest generation (which is specifically tailored to handle AI), I thought performance had to be better considering the simple setup (on importing the docs and the querying).

I started looking at the different parts and replaced the llama3.2 model with smaller versions, but the performance did not increase a lot.

Next, I looked into enabling specific GPU options in my docker compose, but it tuns out only Nvidia GPUs have a specific configuration. Additionally, I came across an article demonstrating that, for now, Apple silicon GPUs are not supported on Docker; meaning that using my initial configuration, the computation would use only CPU power. So I gave it a try (the article is in the references should you be interested).

New configuration

I slightly modified my configuration by removing the Ollama from the docker compose and ran it natively on my Macbook (Ollama has a dedicated download for the different operating systems). Here are the test results for the same test cases as above.

The test results with the modified setup

  • Import docs
    • Importing my docs into the vector database took between 52 seconds and 57 seconds.
  • Search:
    • Question 1
      • Get embeddings: took between 0,086 and 0,29 seconds
      • Get an answer from the LLM without using the RAG: took between 7.3 and 15 seconds
      • Get an answer from the LLM with using the RAG: took between 10 and 15 seconds
    • Question 2
      • Get embeddings: took 0,05 seconds
      • Get an answer from the LLM without using the RAG: took between 6,4 and 16 seconds
      • Get an answer from the LLM with using the RAG: took between 15 and 26 seconds
    • Question 3
      • Get embeddings: took 0,05 seconds
      • Get an answer from the LLM without using the RAG: took 8 seconds
      • Get an answer from the LLM with using the RAG: took 8 seconds
    • Question 4
      • Get embeddings: took 0,05 seconds
      • Get an answer from the LLM without using the RAG: took 25 seconds
      • Get an answer from the LLM with using the RAG: took between 9 and 13 seconds

Conclusion

Obviously, the test results with the modified test setup are way better and having an application which respond within 10 seconds seems reasonable for a local setup.

Keep in mind that I did not talk at all about the quality of the responses, only the response time. Also, I did not stress test the setup; having multiple users querying the model at the same time may impact response time.

Strangely, it seems like using Docker does not allow me to use the GPU of my Macbook (or maybe I missed a configuration?).

Anyway, if you were to have a local setup using Ollama on a Macbook, you should run your models natively (at least for now until GPU support for Apple Silicon comes to Docker).

References

  • https://chariotsolutions.com/blog/post/apple-silicon-gpus-docker-and-ollama-pick-two/
  • https://ollama.com/blog/ollama-is-now-available-as-an-official-docker-image