Why I Built This Instead of Using Ollama

I wanted to run AI models locally on my Arch Linux laptop — privately, offline, with zero cloud dependency. Ollama seemed like the obvious choice, but I didn’t want an opaque, heavy framework. I wanted control.

After navigating outdated documentation and breaking changes, I got llama.cpp running with Vulkan GPU acceleration on my Intel Iris Xe. This is the guide I wish I had.

Who this is for: Any Linux user with 8–16 GB RAM who wants a private, lightweight, local LLM — without Docker, without heavy frameworks, without cloud APIs.

System Specs

OS:  Arch Linux x86_64
CPU: Intel i5-1137G7 (4C/8T)
GPU: Intel Iris Xe Graphics
RAM: 16 GB
WM:  Hyprland (Wayland)

Step 1: Install Dependencies

Everything you need is in the official Arch repos:

sudo pacman -S git cmake ninja vulkan-intel vulkan-devel shaderc

⚠️ Critical: Don’t skip vulkan-devel. The vulkan-intel package only provides runtime support. Without vulkan-devel, the GPU build will silently fall back to CPU-only mode.

Step 2: Clone llama.cpp

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

Note: Clone the repo directly — don’t create the folder manually first.

Step 3: Build with Vulkan GPU Acceleration

cmake -B build -G Ninja -DGGML_VULKAN=ON
cmake --build build

Verify the binary was created:

ls build/bin | grep llama-cli

Important: The old ./main binary is gone. Use llama-cli instead.

Step 4: Download a Model

Create a models directory and download a quantized model:

mkdir -p models
cd models
wget https://huggingface.co/Qwen/Qwen2.5-3B-Instruct-GGUF/resolve/main/qwen2.5-3b-instruct-q4_k_m.gguf
cd ..

Troubleshooting: If the file saves as .gguf.1, rename it:

mv models/*.gguf.1 models/qwen2.5-3b-instruct-q4_k_m.gguf

Step 5: Run the Model

./build/bin/llama-cli \
  -m models/qwen2.5-3b-instruct-q4_k_m.gguf \
  -t 8 \
  -ngl 20 \
  -c 2048

Flag breakdown:

  • -t 8 — Use all 8 CPU threads
  • -ngl 20 — Offload 20 layers to GPU
  • -c 2048 — Context window size

If Vulkan is working correctly, you’ll see:

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Xe Graphics

Common Errors & Solutions

Error: fish: Unknown command: ./main

Solution: The binary name changed. Use:

./build/bin/llama-cli

Error: Makefile:6: Build system changed

Solution: Rebuild with CMake:

cmake -B build -G Ninja -DGGML_VULKAN=ON && cmake --build build

Error: Could NOT find Vulkan

Solution: Install the development headers:

sudo pacman -S vulkan-devel

Error: LLAMA_VULKAN not used

Solution: The flag changed. Use -DGGML_VULKAN=ON instead of -DLLAMA_VULKAN=ON.

Error: invalid argument: -i

Solution: Interactive mode is now the default. Remove the -i flag.

Model Selection Guide

Model Size RAM Usage Speed Best For
Qwen 2.5 3B Q4_K_M ~2 GB 2–3 GB ⚡ Fast Beginners, quick responses
Mistral 7B Q4_K_M ~4 GB 5–6 GB ▶ Medium Balanced quality/speed
Llama 3 8B Q4_K_M ~5 GB 6–8 GB 🐢 Slower Best quality output

Recommendation: Start with Qwen 2.5 3B. It’s fast, lightweight, and surprisingly capable for most tasks.

Bonus: Run as a Local API Server

Want to access your LLM through a web interface or API?

./build/bin/llama-server \
  -m models/qwen2.5-3b-instruct-q4_k_m.gguf \
  -ngl 20

Then open your browser to:

http://localhost:8080

You now have a local ChatGPT-style interface with no internet connection required.

What I Learned

The biggest challenges weren’t technical — they were navigating breaking changes and outdated documentation:

  1. Build system changed: make → CMake with Ninja
  2. Flag renamed: LLAMA_VULKANGGML_VULKAN
  3. Missing dependency: Need vulkan-devel, not just vulkan-intel
  4. Interactive flag removed: -i is now default behavior
  5. Binary renamed: mainllama-cli

Once these issues were resolved, everything ran smoothly — even on Intel Iris Xe integrated graphics.

Performance Notes

On my Intel i5-1137G7 with Iris Xe:

  • Qwen 2.5 3B: ~15-20 tokens/second
  • Mistral 7B: ~8-12 tokens/second
  • GPU offloading: 2-3x faster than CPU-only

Your mileage will vary based on model size, quantization level, and hardware.

What’s Next?

Next step: integrate Open WebUI to turn this into a full local ChatGPT alternative with conversation history, model switching, and a polished interface.

Resources


Running AI locally isn’t just about privacy — it’s about understanding the tools you use. No black boxes, no vendor lock-in, just you and the model.

Happy hacking 🐧