Why I Built This Instead of Using Ollama
I wanted to run AI models locally on my Arch Linux laptop — privately, offline, with zero cloud dependency. Ollama seemed like the obvious choice, but I didn’t want an opaque, heavy framework. I wanted control.
After navigating outdated documentation and breaking changes, I got llama.cpp running with Vulkan GPU acceleration on my Intel Iris Xe. This is the guide I wish I had.
Who this is for: Any Linux user with 8–16 GB RAM who wants a private, lightweight, local LLM — without Docker, without heavy frameworks, without cloud APIs.
System Specs
OS: Arch Linux x86_64
CPU: Intel i5-1137G7 (4C/8T)
GPU: Intel Iris Xe Graphics
RAM: 16 GB
WM: Hyprland (Wayland)
Step 1: Install Dependencies
Everything you need is in the official Arch repos:
sudo pacman -S git cmake ninja vulkan-intel vulkan-devel shaderc
⚠️ Critical: Don’t skip vulkan-devel. The vulkan-intel package only provides runtime support. Without vulkan-devel, the GPU build will silently fall back to CPU-only mode.
Step 2: Clone llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
Note: Clone the repo directly — don’t create the folder manually first.
Step 3: Build with Vulkan GPU Acceleration
cmake -B build -G Ninja -DGGML_VULKAN=ON
cmake --build build
Verify the binary was created:
ls build/bin | grep llama-cli
Important: The old ./main binary is gone. Use llama-cli instead.
Step 4: Download a Model
Create a models directory and download a quantized model:
mkdir -p models
cd models
wget https://huggingface.co/Qwen/Qwen2.5-3B-Instruct-GGUF/resolve/main/qwen2.5-3b-instruct-q4_k_m.gguf
cd ..
Troubleshooting: If the file saves as .gguf.1, rename it:
mv models/*.gguf.1 models/qwen2.5-3b-instruct-q4_k_m.gguf
Step 5: Run the Model
./build/bin/llama-cli \
-m models/qwen2.5-3b-instruct-q4_k_m.gguf \
-t 8 \
-ngl 20 \
-c 2048
Flag breakdown:
-t 8— Use all 8 CPU threads-ngl 20— Offload 20 layers to GPU-c 2048— Context window size
If Vulkan is working correctly, you’ll see:
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Xe Graphics
Common Errors & Solutions
Error: fish: Unknown command: ./main
Solution: The binary name changed. Use:
./build/bin/llama-cli
Error: Makefile:6: Build system changed
Solution: Rebuild with CMake:
cmake -B build -G Ninja -DGGML_VULKAN=ON && cmake --build build
Error: Could NOT find Vulkan
Solution: Install the development headers:
sudo pacman -S vulkan-devel
Error: LLAMA_VULKAN not used
Solution: The flag changed. Use -DGGML_VULKAN=ON instead of -DLLAMA_VULKAN=ON.
Error: invalid argument: -i
Solution: Interactive mode is now the default. Remove the -i flag.
Model Selection Guide
| Model | Size | RAM Usage | Speed | Best For |
|---|---|---|---|---|
| Qwen 2.5 3B Q4_K_M | ~2 GB | 2–3 GB | ⚡ Fast | Beginners, quick responses |
| Mistral 7B Q4_K_M | ~4 GB | 5–6 GB | ▶ Medium | Balanced quality/speed |
| Llama 3 8B Q4_K_M | ~5 GB | 6–8 GB | 🐢 Slower | Best quality output |
Recommendation: Start with Qwen 2.5 3B. It’s fast, lightweight, and surprisingly capable for most tasks.
Bonus: Run as a Local API Server
Want to access your LLM through a web interface or API?
./build/bin/llama-server \
-m models/qwen2.5-3b-instruct-q4_k_m.gguf \
-ngl 20
Then open your browser to:
http://localhost:8080
You now have a local ChatGPT-style interface with no internet connection required.
What I Learned
The biggest challenges weren’t technical — they were navigating breaking changes and outdated documentation:
- Build system changed:
make→ CMake with Ninja - Flag renamed:
LLAMA_VULKAN→GGML_VULKAN - Missing dependency: Need
vulkan-devel, not justvulkan-intel - Interactive flag removed:
-iis now default behavior - Binary renamed:
main→llama-cli
Once these issues were resolved, everything ran smoothly — even on Intel Iris Xe integrated graphics.
Performance Notes
On my Intel i5-1137G7 with Iris Xe:
- Qwen 2.5 3B: ~15-20 tokens/second
- Mistral 7B: ~8-12 tokens/second
- GPU offloading: 2-3x faster than CPU-only
Your mileage will vary based on model size, quantization level, and hardware.
What’s Next?
Next step: integrate Open WebUI to turn this into a full local ChatGPT alternative with conversation history, model switching, and a polished interface.
Resources
Running AI locally isn’t just about privacy — it’s about understanding the tools you use. No black boxes, no vendor lock-in, just you and the model.
Happy hacking 🐧