Running LLMs Locally on Arch Linux — No Cloud, No Ollama

Why I Built This Instead of Using Ollama

I wanted to run AI models locally on my Arch Linux laptop — privately, offline, with zero cloud dependency. Ollama seemed like the obvious choice, but I didn’t want an opaque, heavy framework. I wanted control.

After navigating outdated documentation and breaking changes, I got llama.cpp running with Vulkan GPU acceleration on my Intel Iris Xe. This is the guide I wish I had.

Who this is for: Any Linux user with 8–16 GB RAM who wants a private, lightweight, local LLM — without Docker, without heavy frameworks, without cloud APIs.

System Specs

OS:  Arch Linux x86_64
CPU: Intel i5-1137G7 (4C/8T)
GPU: Intel Iris Xe Graphics
RAM: 16 GB
WM:  Hyprland (Wayland)

Step 1: Install Dependencies

Everything you need is in the official Arch repos:

sudo pacman -S git cmake ninja vulkan-intel vulkan-devel shaderc

⚠️ Critical: Don’t skip vulkan-devel. The vulkan-intel package only provides runtime support. Without vulkan-devel, the GPU build will silently fall back to CPU-only mode.

Step 2: Clone llama.cpp

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

Note: Clone the repo directly — don’t create the folder manually first.

Step 3: Build with Vulkan GPU Acceleration

cmake -B build -G Ninja -DGGML_VULKAN=ON
cmake --build build

Verify the binary was created:

ls build/bin | grep llama-cli

Important: The old ./main binary is gone. Use llama-cli instead.

Step 4: Download a Model

Create a models directory and download a quantized model:

mkdir -p models
cd models
wget https://huggingface.co/Qwen/Qwen2.5-3B-Instruct-GGUF/resolve/main/qwen2.5-3b-instruct-q4_k_m.gguf
cd ..

Troubleshooting: If the file saves as .gguf.1, rename it:

mv models/*.gguf.1 models/qwen2.5-3b-instruct-q4_k_m.gguf

Step 5: Run the Model

./build/bin/llama-cli \
  -m models/qwen2.5-3b-instruct-q4_k_m.gguf \
  -t 8 \
  -ngl 20 \
  -c 2048

Flag breakdown:

-t 8 — Use all 8 CPU threads
-ngl 20 — Offload 20 layers to GPU
-c 2048 — Context window size

If Vulkan is working correctly, you’ll see:

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Xe Graphics

Common Errors & Solutions

Error: `fish: Unknown command: ./main`

Solution: The binary name changed. Use:

./build/bin/llama-cli

Error: `Makefile:6: Build system changed`

Solution: Rebuild with CMake:

cmake -B build -G Ninja -DGGML_VULKAN=ON && cmake --build build

Error: `Could NOT find Vulkan`

Solution: Install the development headers:

sudo pacman -S vulkan-devel

Error: `LLAMA_VULKAN not used`

Solution: The flag changed. Use -DGGML_VULKAN=ON instead of -DLLAMA_VULKAN=ON.

Error: `invalid argument: -i`

Solution: Interactive mode is now the default. Remove the -i flag.

Model Selection Guide

Model	Size	RAM Usage	Speed	Best For
Qwen 2.5 3B Q4_K_M	~2 GB	2–3 GB	⚡ Fast	Beginners, quick responses
Mistral 7B Q4_K_M	~4 GB	5–6 GB	▶ Medium	Balanced quality/speed
Llama 3 8B Q4_K_M	~5 GB	6–8 GB	🐢 Slower	Best quality output

Recommendation: Start with Qwen 2.5 3B. It’s fast, lightweight, and surprisingly capable for most tasks.

Bonus: Run as a Local API Server

Want to access your LLM through a web interface or API?

./build/bin/llama-server \
  -m models/qwen2.5-3b-instruct-q4_k_m.gguf \
  -ngl 20

Then open your browser to:

http://localhost:8080

You now have a local ChatGPT-style interface with no internet connection required.

What I Learned

The biggest challenges weren’t technical — they were navigating breaking changes and outdated documentation:

Build system changed: make → CMake with Ninja
Flag renamed: LLAMA_VULKAN → GGML_VULKAN
Missing dependency: Need vulkan-devel, not just vulkan-intel
Interactive flag removed: -i is now default behavior
Binary renamed: main → llama-cli

Once these issues were resolved, everything ran smoothly — even on Intel Iris Xe integrated graphics.

Performance Notes

On my Intel i5-1137G7 with Iris Xe:

Qwen 2.5 3B: ~15-20 tokens/second
Mistral 7B: ~8-12 tokens/second
GPU offloading: 2-3x faster than CPU-only

Your mileage will vary based on model size, quantization level, and hardware.

What’s Next?

Next step: integrate Open WebUI to turn this into a full local ChatGPT alternative with conversation history, model switching, and a polished interface.

Resources

Running AI locally isn’t just about privacy — it’s about understanding the tools you use. No black boxes, no vendor lock-in, just you and the model.

Happy hacking 🐧

Why I Built This Instead of Using Ollama#

System Specs#

Step 1: Install Dependencies#

Step 2: Clone llama.cpp#

Step 3: Build with Vulkan GPU Acceleration#

Step 4: Download a Model#

Step 5: Run the Model#

Common Errors & Solutions#

Error: fish: Unknown command: ./main#

Error: Makefile:6: Build system changed#

Error: Could NOT find Vulkan#

Error: LLAMA_VULKAN not used#

Error: invalid argument: -i#

Model Selection Guide#

Bonus: Run as a Local API Server#

What I Learned#

Performance Notes#

What’s Next?#

Resources#