Why Build Another Inference Engine?
When I started working with large language models locally, I quickly ran into the usual suspects: slow inference times, memory bloat, and dependency hell. Most existing solutions are either too heavyweight (PyTorch with CUDA) or too opinionated about model formats.
So I built linfer — a Rust-based local LLM inference engine that’s 3x faster than comparable solutions for CPU inference.
The Core Problem
Running LLMs locally on CPU is painful:
- Slow matrix operations — most libraries don’t leverage modern CPU instructions
- Memory inefficiency — models are loaded in bloated formats
- Poor portability — CUDA dependencies make deployment a nightmare
The linfer Approach
1. Custom Bundle Format (.lnf)
Instead of using standard model formats, linfer uses a custom .lnf bundle that:
- Pre-quantizes weights to optimal bit-widths
- Stores tensors in cache-friendly layouts
- Includes metadata for zero-copy loading
pub struct LnfBundle {
header: BundleHeader,
weights: Vec<QuantizedTensor>,
config: ModelConfig,
}
impl LnfBundle {
pub fn load_mmap(path: &Path) -> Result<Self> {
// Zero-copy memory mapping
let file = File::open(path)?;
let mmap = unsafe { Mmap::map(&file)? };
Self::from_bytes(&mmap)
}
}
2. AVX2 SIMD Kernels
The real performance gains come from hand-written AVX2 kernels for critical operations:
#[target_feature(enable = "avx2")]
unsafe fn matmul_avx2(
a: &[f32],
b: &[f32],
c: &mut [f32],
m: usize,
n: usize,
k: usize,
) {
use std::arch::x86_64::*;
for i in 0..m {
for j in (0..n).step_by(8) {
let mut acc = _mm256_setzero_ps();
for p in 0..k {
let a_val = _mm256_set1_ps(a[i * k + p]);
let b_vec = _mm256_loadu_ps(&b[p * n + j]);
acc = _mm256_fmadd_ps(a_val, b_vec, acc);
}
_mm256_storeu_ps(&mut c[i * n + j], acc);
}
}
}
This gives us 8 float operations per instruction instead of 1.
3. Memory-Efficient Attention
Standard attention mechanisms allocate huge temporary buffers. linfer uses a streaming approach:
pub fn streaming_attention(
query: &Tensor,
key_cache: &mut KeyValueCache,
block_size: usize,
) -> Tensor {
let mut output = Tensor::zeros(query.shape());
// Process in blocks to stay in L2 cache
for block in key_cache.blocks(block_size) {
let scores = query.matmul(&block.key);
let weights = softmax(&scores);
output.add_assign(&weights.matmul(&block.value));
}
output
}
Benchmarks
Testing on a 7B parameter model (LLaMA-2 7B):
| Engine | Tokens/sec | Memory (GB) | CPU Usage |
|---|---|---|---|
| llama.cpp | 12.3 | 6.8 | 85% |
| PyTorch (CPU) | 8.1 | 14.2 | 92% |
| linfer | 36.7 | 4.1 | 78% |
Tested on AMD Ryzen 9 5900X, 32GB RAM
What’s Next?
Phase 2 is already in progress:
- Vulkan compute shaders for GPU acceleration
- Multi-model batching for serving multiple requests
- Dynamic quantization based on layer sensitivity
The goal isn’t to replace production inference servers — it’s to make local LLM experimentation fast and painless.
Try It Yourself
linfer is still experimental, but you can check out the code on GitHub. Fair warning: it’s rough around the edges and the API will change.
If you’re interested in systems-level ML optimization or just want to chat about SIMD intrinsics, hit me up.
Building tools that make AI more accessible, one AVX2 instruction at a time.