Skip to content

Sovereign AI on a Desktop, Part 3: The Autoresearcher

Mihai Chiorean | March 2026

Series: Sovereign AI on a Desktop

Part 1: The Stack -- What I'm running and why Part 2: Five Bugs in NVIDIA's Code -- Fixing TensorRT-LLM for DGX Spark Part 3: The Autoresearcher -- An AI agent that optimizes its own inference (you are here) Part 4: 100K Context -- KV cache compression via TurboQuant Part 5: The Bandwidth Wall -- What actually limits a $3,000 desktop


I spent two days manually tuning llama.cpp parameters. Change one knob, restart the server (10 minutes to load 95GB from NVMe), run a test, write down the number. By experiment six I was making mistakes -- testing the wrong setting, forgetting to reset the context size, accidentally running against the wrong model. So I automated myself out of the loop.

The autoresearcher is a Python framework that runs experiments while I sleep. It found something I would have missed: q4_0 KV cache quantization -- aggressive 4-bit -- causes zero quality loss on MiniMax M2.5. None. This was not expected, and it changed everything.

The Problem

A 229B MoE model on 128GB unified memory with 273 GB/s bandwidth has dozens of configuration knobs, and the interactions between them are non-obvious:

Parameter Range Why it matters
n_gpu_layers 0-62 How much of the 62-layer MoE lives on GPU vs CPU
ctx_size 2K-96K Context window -- more context means more KV cache memory
cache_type_k / cache_type_v f16, q8_0, q4_0 KV cache quantization -- the biggest memory lever
flash_attn on / off Flash attention for memory-efficient attention
n_threads 1-20 CPU threads (Spark has 20 ARM cores)
n_draft 0-16 Speculative decoding (M2.5 has built-in MTP)

Plus batch size, NUMA strategy, memory mapping, memory locking, and more. Manual exploration of this space is error-prone and slow. Each experiment takes 15-20 minutes, dominated by the 10+ minute model load time. A thorough sweep would take days by hand, and I would make mistakes.

How It Works

The autoresearcher (autoresearch/optimize.py) runs a loop:

  1. Start a llama-server with a candidate configuration
  2. Wait for health -- poll /health until the model finishes loading (10+ minutes for 95GB via mmap)
  3. Run quality checks -- known-answer prompts with keyword scoring. "What happens when you split an atom?" must contain "fission" or "fusion." "Write a Fibonacci function" must contain "memo" or "cache." Simple but effective sanity checks.
  4. Run speed checks -- measure decode tok/s, prefill tok/s, time to first token
  5. Test context windows -- fill context to the target size, then check if the model can still recall a fact planted at the beginning
  6. Record everything to JSONL -- full config, all timings, quality scores, peak memory via /proc/meminfo
  7. Kill the server and start the next experiment

The search strategy has two phases. Phase 1 explores each axis independently from a known-good default config. For n_gpu_layers, it tests 0, 31, and 62. For ctx_size, it tests 2K, 16K, and 32K. For KV cache types, it tests f16, q8_0, and q4_0. Three values per axis, 13 axes, ~20 experiments total.

Phase 2 takes the best configuration from Phase 1 and does targeted perturbations -- small variations around the optimum to check for local improvements.

For teams with an LLM API key, there is also an agent.py mode that uses Claude to analyze the experiment history and propose the next configuration -- literally an AI optimizing its own inference stack. Without a key, it falls back to the systematic grid search.

The Key Finding: q4_0 KV Is Lossless on M2.5

The most important result from 15 experiments:

KV cache quantization to q4_0 causes zero quality degradation on MiniMax M2.5.

This was not obvious. q4_0 is aggressive quantization -- 4 bits per element plus a block scale, averaging 4.5 bits per weight. On many models, q4_0 KV cache causes measurable accuracy loss. On M2.5, every quality prompt produced identical answers. Every context recall test passed. The autoresearcher scored q4_0 KV identically to f16 KV across all benchmarks.

Why? M2.5's GQA architecture uses 48 attention heads but only 8 KV heads. Each KV head is shared across 6 attention heads. The KV representations are already highly compressed by the GQA structure -- they encode broad, redundant information rather than fine-grained per-head detail. Quantizing this already-redundant signal to 4-bit loses nothing that the attention mechanism can detect.

This single finding unlocked 65K context in 128GB. Without q4_0 KV cache, the maximum context at f16 precision is about 16K before OOM. With q4_0, 65K fits alongside the 95GB model weights with thin margins. That is a 4x increase in usable context from a configuration change that costs zero quality.

The 9 tok/s Anomaly

Not everything the autoresearcher found was good news.

One experiment tested mixed KV cache: q8_0 for keys, q4_0 for values. In theory, this should give better quality than q4_0/q4_0 (higher-precision keys preserve attention routing) at slightly more memory than q4_0/q4_0 but less than q8_0/q8_0. A reasonable hypothesis.

The result: 9 tok/s decode. Three times slower than the 24 tok/s baseline. And the context recall tests failed at 16K tokens.

I investigated. The issue was in llama.cpp's flash attention dispatch. When K and V have different quantization types, the kernel takes a different code path that, on SM121, falls back to a scalar implementation instead of the vectorized one. The performance cliff is catastrophic -- not a gradual degradation but a discrete fallback to a fundamentally slower kernel.

This is not a llama.cpp bug per se -- mixed KV types are a less-tested configuration, and the FA kernel dispatch prioritizes correctness over performance for edge cases. But it means mixed KV is not viable on the Spark. Either quantize K and V the same, or use the TurboQuant asymmetric approach (Part 4) which handles the dispatch correctly at the kernel level.

The autoresearcher caught this in experiment 8 of 15. By hand, I might have tried mixed KV once, seen the slow speed, blamed something else, and moved on without understanding why.

The Build Flag Discovery

Another finding that I would have missed manually: the GGML_CUDA_FA_ALL_QUANTS build flag.

By default, llama.cpp only compiles flash attention kernels for a subset of KV quantization types (f16, q8_0, q4_0). If you want FA with other types -- like the TurboQuant types that come later in this series -- you need to build with -DGGML_CUDA_FA_ALL_QUANTS=ON. Without this flag, any non-standard KV type silently falls back to a non-FA attention path that is slower and uses more memory.

I discovered this when an autoresearcher experiment with a custom KV type showed unexpectedly high memory usage. The memory pattern did not match flash attention behavior. Checking the build configuration revealed the missing flag. After rebuilding with it enabled, memory usage dropped to the expected FA level and performance improved.

This is the kind of thing that wastes hours when tuning by hand because the symptom (slightly higher memory, slightly lower speed) does not obviously point to a build configuration issue. The autoresearcher's systematic logging of memory usage per experiment made the pattern visible.

Other Findings

The autoresearcher confirmed several things that are obvious in retrospect but worth verifying on this specific hardware:

  • All 62 layers on GPU (n_gpu_layers 999) is optimal. On unified memory, there is no real CPU/GPU split. Putting layers "on CPU" just means they go through a different (slower) compute path while accessing the same physical memory. There is zero benefit to CPU offloading on the Spark.

  • --flash-attn on is required. This specific build requires the explicit on value, not just the bare --flash-attn flag. Without flash attention, the KV cache memory grows linearly with context instead of being managed in-place, and you OOM at ~16K context.

  • Performance is entirely memory-bandwidth-bound for decode. No build optimization, no thread count change, no batch size adjustment moved decode throughput beyond ~24 tok/s. The ceiling is 273 GB/s / (model_size_in_memory), and no configuration change can break through it. I will analyze this in detail in Part 5.

  • mmap vs preload does not affect steady-state performance. mmap causes the first inference to be slow (page faults as the OS loads data from NVMe), but after the model is fully resident in memory, throughput is identical. mmap does make the server start accepting requests faster (it does not wait for the full 95GB to load), which matters for the systemd health gate.

  • Speculative decoding (MTP) provides no benefit. M2.5 has a built-in multi-token prediction head. I tested n_draft from 1 to 16. No improvement. On bandwidth-bound hardware, speculative decoding generates candidate tokens that still require bandwidth to verify, and the verification cost eats any savings from parallelism. More on this in Part 5.

The Production Configuration

Fifteen experiments, five hours, one production configuration:

llama-server \
  --model MiniMax-M2.5-UD-Q3_K_XL.gguf \
  --host 0.0.0.0 --port 8001 \
  --ctx-size 65536 \
  --cache-type-k q4_0 --cache-type-v q4_0 \
  --flash-attn on \
  --n-gpu-layers 999 \
  --threads 10

This was the baseline configuration that ran my nanobot agent for weeks: 24 tok/s decode, 65K context, q4_0 KV at zero quality cost. The systemd service file wraps this with restart-on-failure and the health gate.

The configuration later evolved. TurboQuant KV compression (Part 4) replaced q4_0 KV with tbq4_0 keys and tbq3_0 values, pushing context from 65K to 96K while actually improving quality. But the autoresearcher's q4_0 finding was the foundation -- it proved that KV compression was the right lever, and that M2.5's architecture is unusually tolerant of it.

What the Autoresearcher Taught Me

The value of automation is not just speed -- it is consistency. The autoresearcher tests every configuration with the same prompts, the same measurement code, the same memory tracking. It does not forget to reset a parameter. It does not accidentally test against the wrong endpoint. It logs everything.

The three findings that mattered: 1. q4_0 KV is lossless on M2.5 (unlocked 65K context) 2. Mixed KV types are broken (9 tok/s anomaly, saved me from a production misconfiguration) 3. The bandwidth wall is absolute (no configuration escapes 273 GB/s)

Finding #1 was the insight that enabled everything that followed. If q4_0 KV had degraded quality on M2.5, I would have been stuck at 16K context with f16 KV, and the TurboQuant work in Part 4 would have had a different starting point.

Finding #3 set the right expectations. There is no magic configuration that makes a 273 GB/s machine decode faster than the bandwidth allows. Once I accepted that, I stopped chasing throughput and focused on what I could improve: context length and quality.


Next: Part 4: 100K Context -- 65K was not enough. I needed more context without buying more hardware. The TurboQuant paper promised 3-6x KV compression. Getting it to work required fixing seven bugs, testing on the wrong model with the wrong corpus, and surviving a thirteen-million perplexity disaster.


Mihai Chiorean is a software engineer in San Francisco. Previously CTO at Wendy Labs (edge OS on Yocto/Jetson), EM at Cash App (compliance rules engine, $100B+ txn volume), and engineer at Uber, Block/TBD, and InVision. He builds sovereign AI systems on NVIDIA hardware and contributes to TensorRT-LLM and NemoClaw.