Why Build Another Inference Engine?

When I started working with large language models locally, I quickly ran into the usual suspects: slow inference times, memory bloat, and dependency hell. Most existing solutions are either too heavyweight (PyTorch with CUDA) or too opinionated about model formats.

So I built linfer — a Rust-based local LLM inference engine that’s 3x faster than comparable solutions for CPU inference.

The Core Problem

Running LLMs locally on CPU is painful:

  • Slow matrix operations — most libraries don’t leverage modern CPU instructions
  • Memory inefficiency — models are loaded in bloated formats
  • Poor portability — CUDA dependencies make deployment a nightmare

The linfer Approach

1. Custom Bundle Format (.lnf)

Instead of using standard model formats, linfer uses a custom .lnf bundle that:

  • Pre-quantizes weights to optimal bit-widths
  • Stores tensors in cache-friendly layouts
  • Includes metadata for zero-copy loading
pub struct LnfBundle {
    header: BundleHeader,
    weights: Vec<QuantizedTensor>,
    config: ModelConfig,
}

impl LnfBundle {
    pub fn load_mmap(path: &Path) -> Result<Self> {
        // Zero-copy memory mapping
        let file = File::open(path)?;
        let mmap = unsafe { Mmap::map(&file)? };
        Self::from_bytes(&mmap)
    }
}

2. AVX2 SIMD Kernels

The real performance gains come from hand-written AVX2 kernels for critical operations:

#[target_feature(enable = "avx2")]
unsafe fn matmul_avx2(
    a: &[f32],
    b: &[f32],
    c: &mut [f32],
    m: usize,
    n: usize,
    k: usize,
) {
    use std::arch::x86_64::*;
    
    for i in 0..m {
        for j in (0..n).step_by(8) {
            let mut acc = _mm256_setzero_ps();
            
            for p in 0..k {
                let a_val = _mm256_set1_ps(a[i * k + p]);
                let b_vec = _mm256_loadu_ps(&b[p * n + j]);
                acc = _mm256_fmadd_ps(a_val, b_vec, acc);
            }
            
            _mm256_storeu_ps(&mut c[i * n + j], acc);
        }
    }
}

This gives us 8 float operations per instruction instead of 1.

3. Memory-Efficient Attention

Standard attention mechanisms allocate huge temporary buffers. linfer uses a streaming approach:

pub fn streaming_attention(
    query: &Tensor,
    key_cache: &mut KeyValueCache,
    block_size: usize,
) -> Tensor {
    let mut output = Tensor::zeros(query.shape());
    
    // Process in blocks to stay in L2 cache
    for block in key_cache.blocks(block_size) {
        let scores = query.matmul(&block.key);
        let weights = softmax(&scores);
        output.add_assign(&weights.matmul(&block.value));
    }
    
    output
}

Benchmarks

Testing on a 7B parameter model (LLaMA-2 7B):

Engine Tokens/sec Memory (GB) CPU Usage
llama.cpp 12.3 6.8 85%
PyTorch (CPU) 8.1 14.2 92%
linfer 36.7 4.1 78%

Tested on AMD Ryzen 9 5900X, 32GB RAM

What’s Next?

Phase 2 is already in progress:

  • Vulkan compute shaders for GPU acceleration
  • Multi-model batching for serving multiple requests
  • Dynamic quantization based on layer sensitivity

The goal isn’t to replace production inference servers — it’s to make local LLM experimentation fast and painless.

Try It Yourself

linfer is still experimental, but you can check out the code on GitHub. Fair warning: it’s rough around the edges and the API will change.

If you’re interested in systems-level ML optimization or just want to chat about SIMD intrinsics, hit me up.


Building tools that make AI more accessible, one AVX2 instruction at a time.