Skip to content

Sovereign AI on a Desktop: Running 229B Parameters Without the Cloud

A blog series by Mihai Chiorean | March 2026

Series Overview

This is a multi-part series about running a 229-billion parameter model on a $3,000 desktop — no cloud, no API keys, no monthly bill. Each post covers a different piece of the puzzle.

Part 1: The Stack — What I'm running and why Part 2: Five Bugs in NVIDIA's Code — Fixing TensorRT-LLM for DGX Spark Part 3: The Autoresearcher — An AI agent that optimizes its own inference Part 4: The Benchmark — llama.cpp Q3 vs TRT-LLM NVFP4 (coming when PRs land) Part 5: 100K Context — KV cache compression via TurboQuant


Part 1: The Stack

I run a 229-billion parameter model on my desk. Not in the cloud. Not on a $30,000 server. On a DGX Spark — NVIDIA's $3,000 Grace Blackwell desktop with 128GB of unified memory and a GB10 GPU.

The model is MiniMax M2.5, a Mixture-of-Experts architecture with 256 experts and 10 billion active parameters per token. It powers my personal AI agent — a sovereign, self-hosted assistant called nanobot that connects to Discord, Telegram, and Slack, runs shell commands, searches the web, and manages my schedule. No cloud API keys. No per-token billing. No data leaving my machine.

Getting here required fixing two bugs in NVIDIA's own inference stack that nobody at NVIDIA had caught — because nobody had tried to run a 229B MoE model on this hardware before.

The Setup: Sovereign AI on a Desktop

The DGX Spark is a strange machine. It has a Grace ARM CPU and a Blackwell GB10 GPU sharing a single 128GB LPDDR5x memory pool via NVLink-C2C. There's no separate VRAM — CPU and GPU see the same memory. This means you can load models that would never fit in a traditional GPU's memory, but you're bottlenecked by 273 GB/s memory bandwidth instead of the 2+ TB/s you'd get on a datacenter B200.

I chose MiniMax M2.5 because it's the best open model at its size class — outperforming many 70B dense models while only activating 10B parameters per token (the MoE advantage). At Q3_K_XL quantization, it fits in ~95GB, leaving room for a 64K-token KV cache.

The inference stack: llama.cpp serving via llama-server, running 24/7 as a systemd service. My nanobot agent connects to it on localhost:8001 and routes Discord messages through it. It works. 24 tokens/second decode, 64K context, completely private.

But I wanted more. TensorRT-LLM promised native FP4 tensor core support — Blackwell's hardware-accelerated 4-bit format that could mean faster inference and better quality than software quantization. So I tried to build TRT-LLM for the Spark.

It didn't work.

PR #12141: The Shared Memory Overflow Nobody Saw

[Merged]github.com/NVIDIA/TensorRT-LLM/pull/12141

When I tried to run TensorRT-LLM's FP4 GEMM kernels on the DGX Spark, I got a cryptic kErrorInternal error. No useful error message. Just a silent failure deep in the CUTLASS kernel dispatch.

After digging through NVIDIA's CUTLASS code, I found the problem: the FP4 GEMM tile configurations for SM120 (the Blackwell architecture family that includes RTX 5090 and DGX Spark) were copied from SM100 (the datacenter B200/B100) without accounting for a critical hardware difference.

SM100 (B200): ~227 KiB shared memory per block. SM12x (RTX 5090, DGX Spark): ~99 KiB shared memory per block.

The tile sizes 128x128x256B and 256x128x128B were designed for SM100's generous shared memory budget. On SM12x, CUTLASS's StageCountAutoCarveout would compute pipeline stage sizes at compile time using the architecture's SharedMemoryCapacity, and the resulting SharedStorage struct simply didn't fit. This is a compile-time structural failure — you can't fix it at runtime by tuning parameters.

The fix was to add a CtaShape128x128x128B tile configuration that fits within SM12x's 99 KiB limit. This tile was already compiled in the codebase (for bf16, fp16, and fp32 output types) but was never wired into the SM120 dispatch path. Three changes:

  1. Added the 128x128x128B dispatch case to dispatchNVFP4xNVFP4GemmCTAShapeSm120
  2. Made it the default tile config for SM120+ (the autotuner still profiles all candidates)
  3. Left the SM100 path completely untouched — datacenter GPUs use their own tile configs

An NVIDIA engineer (@eugr) caught a comment error in my first submission — I'd incorrectly labeled which arch had which SMEM size. Fixed in the same PR. NVIDIA's CI ran the full test suite twice, it passed, and it was merged on March 18.

This unblocks FP4 GEMM for every SM12x device: RTX 5090, RTX 5080, and DGX Spark.

PR #12301: The Unnecessary Memcpy That Wasted Bandwidth

[Open]github.com/NVIDIA/TensorRT-LLM/pull/12301

The second issue was more subtle. TensorRT-LLM's KV cache manager has a tiered memory system: a primary pool (GPU memory) and a secondary pool (CPU memory). When context gets long, it "offloads" KV cache blocks from GPU to CPU, and "onboards" them back when needed. On a datacenter GPU with separate VRAM and system RAM, this makes sense — you're moving data between fast and slow memory.

On the DGX Spark, it's absurd. CPU and GPU share the same 128GB LPDDR5x pool. There is no "GPU memory" vs "CPU memory" — it's all the same physical memory accessible via the same memory controller at the same bandwidth. The offload/onboard memcpy operations are copying data from a physical address to... the same physical address. Zero benefit, wasted bandwidth, added latency.

This matters because bandwidth is THE bottleneck on the Spark. At 273 GB/s, every unnecessary memory operation directly competes with the actual inference computation. On a bandwidth-limited device, wasting cycles on no-op copies is the difference between 24 tok/s and something better.

My fix adds runtime detection of unified memory using cudaDevAttrPageableMemoryAccess and three optimizations:

  1. Skip offload/onboard memcpy when unified memory is detected. Block metadata bookkeeping still happens — we just don't do the pointless copy.
  2. Fold secondary cache into primary pool. Instead of two artificial tiers backed by the same memory, give the block manager one large pool. This eliminates the eviction/promotion overhead entirely.
  3. Allocate secondary blocks as GPU memory. On unified memory, use BufferManager::managed instead of BufferManager::pinned, avoiding page-locking overhead.

All three optimizations are gated by a single runtime check. On discrete GPU systems, cudaDevAttrPageableMemoryAccess returns false and the code path is never entered. Zero risk of regression for existing hardware.

This PR is still open, awaiting NVIDIA review.

Why This Matters: The Case for Sovereign AI

These aren't academic fixes. They're the difference between "TensorRT-LLM doesn't work on Spark" and "I can run a 229B model with native FP4 at full tensor core speed on my desk."

The broader point is about sovereign AI — the idea that you should be able to run capable AI models on hardware you own, without depending on cloud APIs. My nanobot agent currently runs on llama.cpp at 24 tok/s. If TensorRT-LLM's FP4 path works correctly on Spark (now that the GEMM overflow is fixed), and the KV cache doesn't waste bandwidth on phantom copies, the performance ceiling goes up significantly.

Here's what the stack looks like today:

nanobot (Python, 4K lines)
  ├── Discord / Telegram / Slack channels
  ├── MCP tool integration
  ├── ChromaDB RAG for memory
  ├── Multi-LLM routing
  └── localhost:8001 → llama-server
                          └── MiniMax M2.5 (229B MoE, Q3_K_XL)
                               └── DGX Spark GB10 (128GB unified, SM121)

Every layer is self-hosted. The model weights are on my NVMe. The inference runs on my GPU. The agent logic is my Python. The conversations stay on my machine. And when two bugs in NVIDIA's stack prevented the next performance leap, I read the CUTLASS source, understood the hardware constraints, and fixed them.

That's what sovereign AI means in practice: not just running someone else's stack, but being able to fix it when it breaks.

Three More Walls

Fixing the FP4 GEMM and KV cache got me past the first two barriers. Then I tried to actually run MiniMax M2.5 end-to-end through TRT-LLM on the Spark. Three more walls.

PR #12309: The Python Gate That Blocked All MoE Models

[Open]github.com/NVIDIA/TensorRT-LLM/pull/12309

The CUTLASS FP4 kernels worked (thanks to PR #12141). But the Python dispatch layer above them still threw NotImplementedError for SM120/SM121. The MoE routing code in fused_moe_trtllm_gen.py had an explicit gate: only SM100 and SM103 (datacenter Blackwell) were allowed. The consumer/desktop Blackwell chips were simply rejected before the code ever reached the working kernels.

This wasn't a hardware limitation — it was an allowlist that nobody had updated. The fix extends can_implement() across four files to include SM120/SM121, and routes them to the TRTLLM MoE backend instead of letting them fall through to an incompatible CUTLASS path. This unblocks MiniMax M2.5, DeepSeek, Qwen, and every other MoE architecture on consumer Blackwell.

PR #12310: The Autotuner Crash

[Open]github.com/NVIDIA/TensorRT-LLM/pull/12310

With the MoE gate fixed, TRT-LLM got further — into the autotuner warmup phase, where it immediately crashed with IndexError: list assignment index out of range. The autotuner's _find_nearest_profile() assumes that every operation produces tensors with the same shape dimensions as the profiling spec expects. On SM121, some ops produce fewer or differently-shaped tensors, and the indexing goes out of bounds.

This is the kind of bug that only surfaces on hardware the test suite doesn't cover. The fix adds bounds checks before indexing into the base profile — out-of-bounds specs are skipped with a debug log, in-bounds behavior is unchanged. Eight test cases covering normal, out-of-range, and mixed scenarios.

PR #12311: The Wrong Attention Kernels

[Open]github.com/NVIDIA/TensorRT-LLM/pull/12311

Past the autotuner, the next crash: the attention backend. TRT-LLM's trtllm-gen FMHA (Flash Multi-Head Attention) runner only has compiled cubins for SM100/SM103. SM120 uses different ISA instructions (mma.sync.aligned.block_scale vs tcgen05.mma), so SM100 cubins can't just be reused — they're fundamentally incompatible at the instruction set level.

The fix routes SM120/SM121 to the FMHA v2 fallback, which does work on SM12x. It's not the fastest path (trtllm-gen FMHA is optimized for SM100), but it runs correctly. SM100/SM103 keep using trtllm-gen FMHA with zero regression. The patch also adds clear warning messages instead of cryptic assertion failures, so the next person hitting this on a 5090 or Spark doesn't spend hours in a debugger.

The Pattern

Five PRs. Same root cause every time: TensorRT-LLM was built and tested on datacenter Blackwell (SM100/SM103). Consumer and desktop Blackwell (SM120/SM121) shares the architecture name but has different shared memory sizes, different ISA instructions, and different resource limits. Nobody at NVIDIA had run the full MoE inference pipeline on a Spark or RTX 5090.

PR Layer Problem Status
#12141 CUTLASS GEMM Tile configs exceed SM12x shared memory Merged
#12301 KV cache No-op memcpy wastes bandwidth on unified memory Open
#12309 MoE dispatch Python allowlist blocks SM120/SM121 Open
#12310 Autotuner Index out of bounds on SM121 tensor shapes Open
#12311 Attention SM100 cubins incompatible with SM120 ISA Open

Each fix is small — the largest is ~100 lines. But they span the entire stack from CUDA kernels to Python dispatch to the serving runtime. You can't just fix one layer. You have to understand the full pipeline and where SM12x diverges from SM100 at each level.

What's Next

Once all five PRs land, the path is clear: convert MiniMax M2.5 to NVFP4 format (~114 GB) and run it through TRT-LLM on the Spark with native FP4 tensor core compute and an optimized unified-memory KV cache. The before/after benchmark — llama.cpp Q3_K_XL vs TRT-LLM NVFP4 — will be the payoff for this series.

There's also PR #11997 from @scottgl9 that ungates FusedMoE at a different layer. Between our combined contributions, the Spark goes from "llama.cpp only" to "full TensorRT-LLM with native FP4, optimized attention, and unified-memory KV cache."

The Convergence: NVIDIA Is Building the Same Week

While I was fixing SM121 blockers from the bottom up, NVIDIA engineers were building from the top down — during GTC week, no less. Three NVIDIA PRs landed the same week as my five:

PR Author What
#12302 NVIDIA Core Qwen 3.5 model support — dense + MoE architectures, weight mappers, config, integration tests
#12265 NVIDIA Qwen 3.5 NVFP4 MoE performance — lm_head sharding, TP8 MoE for the 400B variant
#11997 @scottgl9 Ungate FusedMoE for SM120/SM121 — community PR enabling MoE on consumer Blackwell

This is not a coincidence. The DGX Spark launched with a promise of local AI, but the software stack wasn't ready. Now the community and NVIDIA are converging on the same target from different directions: NVIDIA adding model architectures, community contributors fixing the hardware-specific gaps.

The implication for my setup is significant. Right now I run:

  • MiniMax M2.5 (229B MoE) via llama.cpp on port 8001 — the main brain
  • Qwen3.5-2B via llama.cpp on port 8002 — a 2GB vision sidecar for image understanding

Both are llama.cpp. Once NVIDIA's Qwen 3.5 support (#12302) merges alongside my SM121 fixes, both models could run on TRT-LLM with native FP4 tensor cores. The vision sidecar becomes a TRT-LLM endpoint. M2.5 gets hardware-accelerated inference. Same $3,000 desktop, same models, faster everything.

And the 400B Qwen3.5 MoE variant (#12265 adds NVFP4 support for it) is the next frontier — a model that would make M2.5 look small, potentially runnable on Spark in NVFP4 if the memory math works out.

The DGX Spark costs $3,000. Five bugs stood between it and running a 229B model at tensor core speed. Now they're fixed. And the models are catching up to the hardware.


Part 3: The Autoresearcher

Before I started fixing TRT-LLM, I needed to understand what "good" looks like on this hardware. What's the optimal llama.cpp configuration for a 229B MoE model on 128GB of unified memory with 273 GB/s bandwidth? There are dozens of knobs: GPU layer count, batch size, context length, KV cache quantization, flash attention, NUMA strategy, thread count, memory mapping, speculative decoding.

I wasn't going to tune this by hand. So I built an autoresearcher — inspired by Karpathy's approach of using AI to explore its own training hyperparameters, but applied to inference optimization.

The Search Space

The optimizer defines 13 axes of exploration:

Parameter Range Why it matters
n_gpu_layers 0-62 How much of the 62-layer MoE lives on GPU vs CPU
ctx_size 2K-32K Context window. More = more KV cache memory
cache_type_k/v f16, q8_0, q4_0 KV cache quantization — the biggest memory lever
n_batch 128-2048 Prefill batch size
flash_attn on/off Flash attention for memory-efficient attention
n_threads 1-20 CPU threads (Spark has 20 ARM cores)
numa disabled/distribute/isolate NUMA scheduling strategy
mmap / mlock bool Memory mapping vs locking
n_draft 0-16 Speculative decoding (M2.5 has built-in MTP)

How It Works

The autoresearcher is a Python framework (autoresearch/optimize.py) that:

  1. Starts a llama-server with a candidate configuration
  2. Waits for health (model load can take 10+ minutes for 95GB mmap)
  3. Runs quality checks — known-answer prompts with keyword scoring (fission/fusion, fibonacci/memo, assassination/alliance)
  4. Runs speed checks — measures decode tok/s, prefill tok/s, TTFT
  5. Tests context windows — fills context to target size, checks if the model can still recall
  6. Records everything to JSONL — config, timings, quality scores, memory usage
  7. Kills the server and starts the next experiment

There's a two-phase search strategy. Phase 1 (experiments 0-20) explores each axis independently from the default config — three steps per axis, covering the range. Phase 2 uses the best config from Phase 1 as a base and does targeted perturbations.

For teams with an LLM API key, there's also an agent.py that uses Claude to analyze experiment history and propose the next configuration — a literal AI optimizing its own inference stack. Without a key, it falls back to systematic grid search.

What the Autoresearcher Found

The key discovery: KV cache quantization to q4_0 is completely lossless on MiniMax M2.5. Zero quality degradation on any benchmark prompt. This was not obvious — q4_0 is aggressive quantization (4-bit), and on many models it causes measurable accuracy loss. But M2.5's GQA architecture (48 attention heads, 8 KV heads) is robust to KV cache quantization because the KV heads are already heavily shared.

This single finding unlocked 64K context in 128GB. Without q4_0 KV cache, the maximum context at f16 precision would be ~16K before OOM. With it, 64K fits comfortably alongside the 95GB model weights.

Other findings: - --flash-attn on is required (not just the bare flag — this build needs on/off/auto) - Mixed KV cache (q8 keys + q4 values) is broken on M2.5 — 3x slower decode, fails context tests - Performance is entirely memory-bandwidth-bound at 273 GB/s — no build or config change improves throughput beyond ~24 tok/s - All 62 layers on GPU (n_gpu_layers 999) is optimal — the unified memory means there's no real CPU/GPU split

The autoresearcher ran 15 experiments, each taking 15-20 minutes (dominated by model load time). Total optimization time: ~5 hours. The result was a production-ready configuration that has been serving my nanobot agent 24/7 since.


Part 4: The Benchmark (Coming Soon)

Once the five TRT-LLM PRs land, the payoff: a head-to-head benchmark on the same hardware, same model, same prompts.

The Comparison Matrix

Config Backend Model Format Quantization Context
Baseline llama.cpp b8303 GGUF Q3_K_XL 3-bit weights, q4_0 KV 4K / 64K
Test A TRT-LLM (main + 5 fixes) NVFP4 4-bit tensor core FP4, BF16 attn 4K
Test B TRT-LLM (main + 5 fixes) NVFP4 4-bit tensor core FP4, BF16 attn max achievable

What We're Measuring

Both servers expose OpenAI-compatible /v1/chat/completions endpoints, so the benchmark is backend-agnostic:

  • Decode throughput (tok/s) — the number everyone cares about
  • Prefill throughput (tok/s) — how fast it processes the prompt
  • Time to first token (ms) — latency
  • Peak memory (GB) — via /proc/meminfo (unified memory, so one number)
  • Quality score — known-answer prompts with keyword matching (same prompts the autoresearcher uses)
  • Output similarity — identical prompts to both backends, compare outputs

The expectation: TRT-LLM with native FP4 should be faster (hardware tensor core path vs software dequantization) and higher quality (4-bit vs 3-bit) but tighter on memory (114 GB model vs 95 GB). The KV cache unified memory optimization (PR #12301) is what makes the memory math work — without it, TRT-LLM would waste headroom on phantom copies between GPU and CPU pools that are backed by the same physical memory.

The Script

The benchmark framework builds on the autoresearcher's infrastructure:

# Start both backends
systemctl --user start llama-minimax          # port 8001, Q3_K_XL
scripts/launch_trtllm.sh                      # port 8090, NVFP4

# Run comparison
python bench/compare_backends.py \
  --baseline http://localhost:8001 \
  --test http://localhost:8090 \
  --output results/llama-vs-trtllm.json

Same prompts, same measurement code, same hardware. The only variable is the inference engine and quantization format.

I'll publish the results as Part 4 when the PRs merge and the numbers are real. No predictions, no estimates — just measurements.


Series Overview

Part 5: 100K Context — How TurboQuant Gave Me 53% More Memory — KV cache compression via Walsh-Hadamard rotation and Lloyd-Max codebooks

Why Not TensorRT-LLM?

Here is the uncomfortable math that sent me back to llama.cpp.

The 427 TFLOPS of FP4 compute sitting in the GB10? Not entirely useless — llama.cpp already uses INT8 tensor cores for prompt processing and FP16 tensor cores for flash attention during prefill. But for token generation — the speed the user actually feels — it does not matter. Decode is purely memory-bandwidth-bound: the GPU spends its time reading 4.58 GiB of model weights per token at 273 GB/s, not computing. Even if the tensor cores had infinite throughput, decode would still cap at ~55 tok/s. The compute is fast. The memory bus is the wall.

This means TensorRT-LLM — the entire point of Parts 1 and 2 — is a dead end for MiniMax M2.5 on the Spark. The five PRs I filed unblock TRT-LLM for smaller models (a 70B at NVFP4 is ~35 GB, fits easily), but for my 229B production model, llama.cpp with software dequantization is the only viable path.

So the question becomes: how do I squeeze more out of llama.cpp? The model weights are already at 3 bits — I cannot compress them further without catastrophic quality loss. The only memory left to reclaim is the KV cache.


Part 5: 100K Context — How TurboQuant Gave Me 53% More Memory

MiniMax M2.5 at q4_0 KV cache maxes out at 65K tokens on the Spark. For a personal AI agent that handles long conversations, RAG pipelines, and document analysis, 65K is not enough. The model itself supports 1M tokens — memory is the bottleneck, not architecture. I needed more context without buying more hardware.

The math is simple. At q4_0 (4.5 bits per weight for K+V), a 65K context window burns ~4,464 MiB of KV cache. The 128GB unified pool holds the 95GB model weights plus this cache with thin margins. To get to 100K tokens, I needed to compress the KV cache further — fewer bits per element — without destroying quality.

The Paper

Google's TurboQuant paper (arxiv.org/abs/2504.19874, ICLR 2026) promised 3-6x KV compression with near-lossless quality. The core idea is elegant: attention head vectors are not Gaussian-distributed, so standard uniform quantization wastes bits. TurboQuant fixes this by applying a Walsh-Hadamard Transform (WHT) to rotate vectors into a near-Gaussian distribution, then quantizes with Lloyd-Max codebooks — optimal quantizers for Gaussian data. The rotation is orthogonal, so it preserves dot products. You rotate K and Q the same way, and the attention scores come out identical.

The llama.cpp community was already working on implementations. Multiple forks, multiple approaches. I picked one and started building.

First Attempt: Everything Wrong

My first attempt was a masterclass in compounding errors.

I started with Madreag's turbo3-cuda fork — a CUDA implementation of TurboQuant stage-1 (rotation + 3-bit codebook, no QJL residual). The fork had a shadow cache architecture: dequantize turbo blocks to f16, then feed the f16 data into standard flash attention kernels. Reasonable design. Completely broken in practice.

The shadow cache path produced perplexity 22.4 on what should have been 8.0. Nearly three times the expected value. And I didn't catch it immediately because I was testing on the wrong model with the wrong corpus.

The model: Qwen3.5-2B. It has only 6 attention layers out of 24 total (the rest are GDN layers with no KV cache). Six layers of signal, eighteen layers of noise-free passthrough. Any KV cache bug gets diluted into near-invisibility. The corpus: a copy of the GPL license text stored at /tmp/ppl_corpus.txt instead of WikiText-2. Legal text has highly repetitive structure and unusual token distributions — perplexity errors on legal text run 3-5x larger than on standard benchmarks, amplifying noise instead of signal.

I also tried the fork's turbo4 format (3-bit base + 1-bit QJL residual). Perplexity: 13,844,131. Not a typo — thirteen million. A single missing exception case in the Q_q8_1 dequant dispatch was silently feeding zeros into the attention kernel. Every decision in this first attempt was wrong.

Seven Bugs

Over the next week, I fixed seven bugs in the Madreag fork:

  1. Shadow dequant path — The f16 shadow cache produced wrong values. The dequant kernels for turbo3 and turbo4 wrote correct-looking f16 data, but when fed into the MMA flash attention kernel, the results were garbage. PPL 22.4 vs 8.0 expected. I never found the root cause — instead I bypassed the shadow path entirely, routing all turbo types through native vec dot kernels that read quantized blocks inline.

  2. turbo4 Q_q8_1 exception — The flash attention vec kernel had a switch statement for query quantization types. turbo4 was missing from it. The compiler didn't warn; the kernel just returned zero for every dot product. One line fix, thirteen million PPL points.

  3. turbo4 dequant dispatchconvert.cu didn't have a case for turbo4 in the block-to-float dequantize path. Turbo4 V paired with q8_0 K worked (V goes through a different path), but turbo4 K was silently returning uninitialized memory.

  4. __launch_bounds__ mismatch — The SET_ROWS kernels for turbo3 and turbo4 declared __launch_bounds__(256, 1) but launched with 32 threads. No correctness impact, but the compiler allocated registers for 256 threads per block, wasting occupancy.

  5. ggml context size — An upstream merge dropped the +2 that reserved ggml tensor overhead for the turbo rotation tensors. The context allocator would silently under-allocate, causing sporadic crashes during long inference runs.

  6. Centroid values — The original fork used Lloyd-Max centroids computed for N(0,1). But TurboQuant applies the WHT before quantization, and the post-rotation distribution is not exactly N(0,1) — it depends on the input distribution. The centroids were close (max 1.65% error, 0.06% MSE impact) but not optimal. I verified this was negligible but it sent me down a two-day rabbit hole before I confirmed it.

  7. Struct field order — The TBQ3_0 block struct had qs[] and d in the wrong order relative to TBQ4_0 and upstream. This didn't affect correctness (fields were accessed by name, not offset), but it broke binary compatibility with any future upstream merge. Fixed to match the upstream PR's layout.

Along the way, I discovered that the fork's "turbo3" was actually stage-1 only — rotation plus codebook, without the QJL residual correction that the paper describes for production TurboQuant. This explained why turbo3/turbo3 symmetric gave PPL 8.0 (+5.12%) — 3-bit codebook without residual correction is simply not enough bits for K vectors.

The Key Insight: K Precision Dominates

The pattern became clear after testing every combination:

K (key) precision determines quality. V (value) can be compressed aggressively.

The data is unambiguous. With q8_0 keys, you can compress values to 3-bit turbo and get +0.36% PPL — essentially lossless. With tbq4_0 keys (4-bit), values can go to 3-bit and stay at +1.93%. But 3-bit keys always fail, regardless of V precision. turbo3/turbo3 is +5.12%. tbq3_0/tbq3_0 is +6.96%. The key vectors carry the geometric structure of the attention manifold — compress them too hard and the model loses the ability to distinguish which tokens matter.

This asymmetry is the entire strategy. Use 4-bit keys to preserve attention routing, and 3-bit values to save memory where quality is robust. Average it out: 3.6 bits per weight for the KV cache, down from 4.5 with q4_0.

Building TBQ4_0 and TBQ3_0

With the bug-fixing phase behind me, I built proper TBQ (TurboBlockQuant) types from scratch. TBQ4_0 is 4-bit with a 128-element block (4.125 bpw including the norm scalar). TBQ3_0 is 3-bit at 3.125 bpw. Both use the same Walsh-Hadamard rotation with a fixed orthogonal matrix generated from a deterministic seed, and optimal Lloyd-Max codebooks for the post-rotation Gaussian distribution.

The critical engineering decision was how to handle the rotation inside the flash attention kernel. The naive approach — rotate every K vector during the attention computation — would add O(n * d^2) work per query, where n is sequence length and d is head dimension (128). Unacceptable.

The insight: pre-rotate Q once per kernel invocation via shared memory. Load Q into shared memory, apply the 128x128 rotation matrix, write back. Cost: one matrix-vector multiply per attention head per token. Then for every K vector in the sequence, the dot product is just codebook lookups — 128 table reads and accumulates. The rotation cost is O(d^2) amortized over the entire sequence, making the per-token overhead O(1). For V accumulation, you accumulate in the rotated domain and apply the inverse rotation to the final output — again O(d^2) once, not per-token.

// Pseudocode for the FA kernel hot path
// 1. Pre-rotate Q into shared memory (once per head)
__shared__ float Q_rotated[D];  // D = 128
for (int i = tid; i < D; i += blockDim.x)
    Q_rotated[i] = dot(rotation_row[i], Q_original);
__syncthreads();

// 2. For each K in the sequence: just codebook lookups
for (int pos = 0; pos < seq_len; pos++) {
    float score = 0;
    for (int j = 0; j < D; j++) {
        uint8_t idx = K_block[pos].qs[j];  // 3 or 4 bit index
        score += centroid[idx] * Q_rotated[j];
    }
    score *= K_block[pos].norm;
    // ... softmax, V accumulation ...
}

The implementation spans 7,400 lines of new code across 29 files. New CUDA kernels for SET_ROWS (quantization during KV cache writes), dequantization, and the flash attention vec dot products. New CPU fallback paths. New type registration in ggml. Template instantiations for every combination of K and V types. The rotation matrix itself — a 128x128 orthogonal matrix — lives in device constant memory, loaded once at initialization.

I built and validated everything on SM121 (DGX Spark's GPU architecture) with CUDA 13.2. The branch is at mihai-chiorean/turbo3-cuda feat/tbq4-cuda-fa-sm121.

Validation: Llama-3.1-8B on WikiText-2

All perplexity numbers measured on Llama-3.1-8B-Instruct Q4_K_M with WikiText-2, context 2048, 4 chunks, flash attention on, all layers on GPU:

Config (K/V) PPL vs f16 Avg bpw Status
f16/f16 7.6186 -- 16.0 baseline
tbq4_0/tbq4_0 7.6925 +0.97% 4.125 PASS
tbq4_0/tbq3_0 7.7657 +1.93% 3.6 PASS
q4_0/q4_0 7.7748 +2.05% 4.5 borderline
tbq3_0/tbq3_0 8.1489 +6.96% 3.125 FAIL

The winning config — tbq4_0 keys, tbq3_0 values — lands at +1.93% vs f16. Better quality than the q4_0/q4_0 I was running in production (+2.05%), at 20% fewer bits per KV element. That is the entire trick: better quality AND less memory, because the rotation makes the codebook optimal for the data distribution instead of just slicing the number line into equal intervals.

On MiniMax M2.5 itself, the results were even more favorable. PPL error bars overlapped across all configs — f16, q4_0, tbq4_0, tbq3_0 — meaning TBQ caused no statistically significant quality degradation on the production model.

Learning from TheTom

I was not the only person building TurboQuant for llama.cpp. TheTom's turboquant_plus fork had 3,845 stars, testing across 50+ model/hardware combinations, and several innovations I did not have.

His turbo4 format was impressive — q8_0/turbo4 measured at PPL 7.6038, which is -0.19% vs f16. Negative. Better than uncompressed. This is within noise, but it demonstrates that a well-implemented 4-bit codebook with norm correction can be essentially transparent.

I adopted two techniques from his work:

  1. Norm correction — After quantizing a vector, compute the ratio of the original vector's norm to the reconstructed vector's norm. Store it as a scale factor. During dequantization, multiply by this ratio. This corrects for the systematic norm shrinkage that codebook quantization introduces. Simple, effective, and it moved our tbq4_0/tbq4_0 from +1.07% to +0.97%.

  2. Sparse V dequant — For decode (single-token generation), most V vectors get multiplied by near-zero attention weights. Skip the dequant for entries below a threshold. TheTom reported +22.8% decode speed at 32K context. We adopted a tighter threshold (1e-6) for numerical safety.

But for my specific goal — maximum context extension — our TBQ format at the tbq4_0/tbq3_0 operating point was actually better than TheTom's nearest equivalent. His turbo4/turbo3 landed at +2.35% PPL (above my 2% quality bar), while our tbq4_0/tbq3_0 held at +1.93%. At the aggressive 3-bit V compression point that unlocks the context gain, our format had the edge.

I documented this analysis on the upstream TBQ pull request, along with the full benchmark data and our CUDA implementation notes.

The Result: 100K Tokens

MiniMax M2.5, DGX Spark, 128GB unified memory. Before and after:

Config Context KV Cache Decode tok/s PPL Impact
q4_0/q4_0 (before) 65K 4,464 MiB 24.92 +2.05% vs f16
tbq4_0/tbq3_0 (after) 100K 5,492 MiB 23.83 +1.93% vs f16

100K tokens. 53% more context than the 65K baseline. KV cache uses 5,492 MiB at 100K — more total memory than before, but per-token cost is lower (3.6 bpw vs 4.5 bpw), so you get more tokens per megabyte. Decode speed drops from 24.92 to 23.83 tok/s — a 4% hit from the codebook lookups and rotation overhead. The server runs stable.

To put that in practical terms: 65K tokens is roughly a 50-page document. 100K tokens is roughly 75 pages. For my nanobot agent doing RAG over local documents, reading long Discord threads, or maintaining multi-day conversation history, the difference between "fits" and "doesn't fit" is real. And the quality is actually slightly better than before — +1.93% vs +2.05% PPL degradation relative to f16.

The Upstream Path

This work lives on our fork for now. The upstream path to llama.cpp mainline requires coordination:

  1. PR #21089 needs to land first — it defines the CPU-only TBQ3_0/TBQ4_0 types with QK_K=256 block format, establishing the ABI that all CUDA implementations must match.
  2. Our CUDA flash attention kernels would follow as a separate PR, adapted from QK=128 to QK_K=256 blocks.
  3. SM121-specific fixes and the rotation matrix loading would be follow-up patches.

Our fork uses QK=128 blocks (one rotation group = one attention head dimension), which is binary-incompatible with upstream's QK_K=256 (two rotation groups per block). This was a deliberate choice for SM121 — 128-element blocks align perfectly with the 128-dimensional attention heads in both Llama and MiniMax, avoiding cross-group bookkeeping in the FA kernel. The upstream port will need to handle the 256-element block split.

Needle-in-a-Haystack: Does the Model Still Remember?

Context extension is pointless if the model can't actually use the extra tokens. I ran a Needle-in-a-Haystack test: hide a specific fact in thousands of lines of filler text, then ask the model to retrieve it.

At 5K and 10K tokens, the model found the needle immediately. At 10K with the needle placed in the middle, it found the needle but refused to extract it — MiniMax M2.5's safety layer classified my test as a prompt injection attempt. That's the model's guardrails working correctly, not a TBQ quality issue.

At 30K+ tokens, the tests timed out. Not because the model couldn't find the needle, but because MiniMax M2.5's thinking mode generates unbounded reasoning chains before answering. At long context, the chain-of-thought can take 15+ minutes. A production fix would be configuring thinking budgets, but that's a model configuration question, not a KV cache compression question.

The key finding: TBQ KV cache compression preserves long-range information retrieval. The model sees and processes the same information — it's just stored in fewer bits.

Production: The Switch

The production server is now running with TBQ at 96K context:

systemctl --user status llama-minimax

● llama-minimax.service - MiniMax M2.5 llama-server (TBQ KV, 96K context)
  Active: active (running)
  Config: --cache-type-k tbq4_0 --cache-type-v tbq3_0 --ctx-size 96000

Same port, same API, same nanobot connection. Just 48% more context window. The model answers the same questions the same way — it just remembers more of the conversation.

The Bandwidth Wall

Here is what the blog post does not tell you if I stop here: 96K context does not mean 96K tokens at full speed.

I ran a four-turn conversation, each turn requesting a ~1000-word essay. The decode speed dropped with every turn:

Turn Accumulated Context Decode Speed
1 1.4K tokens 21.8 tok/s
2 2.6K tokens 18.0 tok/s
3 3.8K tokens 16.3 tok/s
4 4.6K tokens 15.1 tok/s

Every additional thousand tokens of context costs about 1.5 tok/s of decode speed. This is linear and fundamental — the attention mechanism reads the entire KV cache on every generated token, and on 273 GB/s bandwidth there is no way around it.

Extrapolating: at 20K context, decode drops to ~12 tok/s. At 50K, maybe 5-8 tok/s. At 90K, you are waiting minutes per response. This is not a TurboQuant issue — the same curve applies to q4_0 at 65K. It is the DGX Spark being a bandwidth-limited machine.

What the 96K window actually gives you:

  • Normal conversations (1-10K): Fast. 15-22 tok/s. The user experience is good.
  • Long conversations (10-30K): Usable. 8-15 tok/s. Noticeably slower but functional.
  • Very long context (30-60K): Slow. 3-8 tok/s. The model still works — it just thinks visibly.
  • Maximum context (60-96K): Available but painful. Useful for one-shot document analysis where you are willing to wait, not for interactive chat.

The old q4_0 setup hard-crashed at 65K. The new TBQ setup lets me use 96K without crashing, and the 20-30K comfortable range is genuinely more than I had before. Most of my nanobot conversations stay under 10K tokens, where the speed is indistinguishable from the old configuration.

The honest conclusion: I got 48% more context ceiling and roughly 30% more usable context, at the cost of 4% peak decode speed. For a $3,000 desktop running a 229B model, I will take that trade.

One More Optimization: Walsh-Hadamard

A reader pointed me to the ITQ3_S paper, which fuses a Walsh-Hadamard Transform into CUDA kernel shared memory loading. My original rotation used a dense 128x128 matrix — 64KB of shared memory, forcing the flash attention kernel to process one query column at a time. The WHT does the same rotation with a 7-stage butterfly pattern in 512 bytes of shared memory. Swapping it in deleted 4,000 lines of rotation matrix constants, freed shared memory for double-width prefill processing, and improved prefill throughput by 15%. Quality was unchanged — both transforms are orthogonal, so the quantization error is identical.

Sometimes the best optimization is replacing an O(n^2) operation with an O(n log n) one that was invented in 1867.

What Did Not Work: Speculative Decoding

The idea behind speculative decoding is simple: guess several tokens cheaply, then verify them all in one forward pass. If most guesses are right, you get multiple tokens for the cost of one model read. On compute-bound hardware, this is a legitimate 1.5-2x speedup.

On the DGX Spark, it does nothing. I added llama.cpp's n-gram speculation flags (--spec-type ngram-mod --draft-max 48) and benchmarked across code generation, technical explanation, and chat tasks. Results: 22.6-23.9 tok/s, identical to the 24 tok/s baseline within noise.

The reason is the same bandwidth wall. Speculative decoding trades compute for fewer sequential steps. But on 273 GB/s, each step is not compute-bound — it is memory-read-bound. The verification pass reads the full model weights regardless of how many draft tokens it checks. You cannot speculate your way out of a bandwidth bottleneck.

I left the flags out of production. On faster-bandwidth hardware (Mac Studio at 800 GB/s, or a datacenter B200 at 8 TB/s HBM3e), speculation would matter. On the Spark, it is overhead.

What I Learned

TurboQuant works. Not the first implementation I tried, and not without a week of debugging, but the math is sound. Rotate into a Gaussian basis, quantize with optimal codebooks, and the information loss is minimal. The Walsh-Hadamard Transform is the key — it is fast (O(d log d) for a d-dimensional vector), orthogonal (preserves norms and dot products), and deterministic (no training data needed, just a fixed seed).

The engineering lessons:

  • Test on the right model with the right corpus. Qwen3.5-2B with 6 attention layers and GPL license text cost me three days of chasing phantom bugs. Llama-3.1-8B with WikiText-2 is the standard for a reason.
  • K precision is non-negotiable. Every attempt at 3-bit keys failed across every implementation I tested. 4-bit is the floor for keys. Values are forgiving.
  • Bypass, don't debug, when time is short. The shadow cache dequant bug was never root-caused. I routed around it with native vec kernels and moved on. Sometimes the fastest path forward is not through the broken code.
  • Read other people's code. TheTom's norm correction and sparse V dequant were better than what I would have built from scratch. The open-source TurboQuant ecosystem — Discussion #20969, TheTom's fork, the upstream PR, spiritbuun's CUDA fork — is a genuine collaboration where everyone's numbers are public and reproducible.

The DGX Spark now runs MiniMax M2.5 at 100K context, 23.83 tok/s, with better quality than the previous 65K configuration. Same $3,000 desktop. Same model. Just smarter compression.


What's Next

Two paths could break through the 273 GB/s bandwidth wall. First: sparse MoE expert loading — MiniMax M2.5 activates only 3% of its 256 experts per token, but llama.cpp currently reads far more than needed. Profiling the actual page fault pattern on unified memory and adding expert prefetching could dramatically reduce the bytes-per-token during decode. Second: Blackwell's tensor cores already accelerate prefill via INT8 MMA, but the KV cache attention path runs on CUDA cores. Block-scaled quantization (tcgen05.mma.blockscaled) could fuse KV dequantization directly into the tensor core matmul — hardware-accelerated TurboQuant, if the math works out.

The hardware is not changing. The software is still catching up.

Mihai Chiorean is a software engineer in San Francisco. Previously CTO at Wendy Labs (edge OS on Yocto/Jetson), EM at Cash App (compliance rules engine, $100B+ txn volume), and engineer at Uber, Block/TBD, and InVision. He builds sovereign AI systems on NVIDIA hardware and contributes to TensorRT-LLM and NemoClaw.