Sovereign AI on a Desktop, Part 4: 100K Context¶
Mihai Chiorean | March 2026
Series: Sovereign AI on a Desktop
Part 1: The Stack -- What I'm running and why Part 2: Five Bugs in NVIDIA's Code -- Fixing TensorRT-LLM for DGX Spark Part 3: The Autoresearcher -- An AI agent that optimizes its own inference Part 4: 100K Context -- KV cache compression via TurboQuant (you are here) Part 5: The Bandwidth Wall -- What actually limits a $3,000 desktop
MiniMax M2.5 at q4_0 KV cache maxes out at 65K tokens on the Spark. For a personal AI agent that handles long conversations, RAG pipelines, and document analysis, 65K is not enough. The model itself supports 1M tokens -- memory is the bottleneck, not architecture. I needed more context without buying more hardware.
The math is simple. At q4_0 (4.5 bits per element for K+V), a 65K context window burns ~4,464 MiB of KV cache. The 128GB unified pool holds the 95GB model weights plus this cache with thin margins. To get to 100K tokens, I needed fewer bits per KV element without destroying quality.
The TRT-LLM NVFP4 path was a dead end (Part 2). The autoresearcher had found the optimal llama.cpp configuration (Part 3). The only remaining lever was KV cache compression -- squeezing more context out of the same memory by storing K and V vectors more efficiently.
The Paper¶
Google's TurboQuant paper (arxiv.org/abs/2504.19874, ICLR 2026) promised 3-6x KV compression with near-lossless quality. The core idea is elegant: attention head vectors are not Gaussian-distributed, so standard uniform quantization wastes bits. TurboQuant fixes this by applying a Walsh-Hadamard Transform (WHT) to rotate vectors into a near-Gaussian distribution, then quantizes with Lloyd-Max codebooks -- optimal quantizers for Gaussian data. The rotation is orthogonal, so it preserves dot products. You rotate K and Q the same way, and the attention scores come out identical.
[TODO: verify TurboQuant was accepted at ICLR 2026 -- the arxiv paper is 2504.19874 from April 2025]
The llama.cpp community was already working on implementations. Multiple forks, multiple approaches. I picked Madreag's turbo3-cuda fork -- a CUDA implementation of TurboQuant stage-1 (rotation + 3-bit codebook, no QJL residual) -- and started building.
First Attempt: Everything Wrong¶
My first attempt was a masterclass in compounding errors.
The fork had a shadow cache architecture: dequantize turbo blocks to f16, then feed the f16 data into standard flash attention kernels. Reasonable design. Completely broken in practice.
The shadow cache path produced perplexity 22.4 on what should have been 8.0. Nearly three times the expected value. And I did not catch it immediately because I was testing on the wrong model with the wrong corpus.
The model: Qwen3.5-2B. It has only 6 attention layers out of 24 total (the rest are GDN layers with no KV cache). Six layers of signal, eighteen layers of noise-free passthrough. Any KV cache bug gets diluted into near-invisibility.
The corpus: a copy of the GPL license text stored at /tmp/ppl_corpus.txt instead of WikiText-2. Legal text has highly repetitive structure and unusual token distributions -- perplexity errors on legal text run 3-5x larger than on standard benchmarks, amplifying noise instead of signal.
I also tried the fork's turbo4 format (3-bit base + 1-bit QJL residual). Perplexity: 13,844,131. Not a typo -- thirteen million. A single missing exception case in the Q_q8_1 dequant dispatch was silently feeding zeros into the attention kernel. Every decision in this first attempt was wrong.
Seven Bugs¶
Over the next week, I fixed seven bugs in the Madreag fork:
-
Shadow dequant path -- The f16 shadow cache produced wrong values. The dequant kernels for turbo3 and turbo4 wrote correct-looking f16 data, but when fed into the MMA flash attention kernel, the results were garbage. PPL 22.4 vs 8.0 expected. I never found the root cause -- instead I bypassed the shadow path entirely, routing all turbo types through native vec dot kernels that read quantized blocks inline.
-
turbo4 Q_q8_1 exception -- The flash attention vec kernel had a switch statement for query quantization types. turbo4 was missing from it. The compiler did not warn; the kernel just returned zero for every dot product. One line fix, thirteen million PPL points.
-
turbo4 dequant dispatch --
convert.cudid not have a case for turbo4 in the block-to-float dequantize path. Turbo4 V paired with q8_0 K worked (V goes through a different path), but turbo4 K was silently returning uninitialized memory. -
__launch_bounds__mismatch -- The SET_ROWS kernels for turbo3 and turbo4 declared__launch_bounds__(256, 1)but launched with 32 threads. No correctness impact, but the compiler allocated registers for 256 threads per block, wasting occupancy. -
ggml context size -- An upstream merge dropped the
+2that reserved ggml tensor overhead for the turbo rotation tensors. The context allocator would silently under-allocate, causing sporadic crashes during long inference runs. -
Centroid values -- The original fork used Lloyd-Max centroids computed for N(0,1). But TurboQuant applies the WHT before quantization, and the post-rotation distribution is not exactly N(0,1) -- it depends on the input distribution. The centroids were close (max 1.65% error, 0.06% MSE impact) but not optimal. I verified this was negligible but it sent me down a two-day rabbit hole before I confirmed it.
-
Struct field order -- The TBQ3_0 block struct had
qs[]anddin the wrong order relative to TBQ4_0 and upstream. This did not affect correctness (fields were accessed by name, not offset), but it broke binary compatibility with any future upstream merge. Fixed to match the upstream PR's layout.
Along the way, I discovered that the fork's "turbo3" was actually stage-1 only -- rotation plus codebook, without the QJL residual correction that the paper describes for production TurboQuant. This explained why turbo3/turbo3 symmetric gave PPL 8.0 (+5.12%) -- 3-bit codebook without residual correction is simply not enough bits for K vectors.
The Key Insight: K Precision Dominates¶
The pattern became clear after testing every combination:
K (key) precision determines quality. V (value) can be compressed aggressively.
The data is unambiguous:
| Config (K/V) | PPL | vs f16 | Status |
|---|---|---|---|
| f16/f16 | 7.6186 | -- | baseline |
| q8_0/tbq3_0 | 7.6844 | +0.86% | PASS |
| q8_0/tbq4_0 | 7.63 | +0.15% | PASS |
| tbq4_0/tbq4_0 | 7.6925 | +0.97% | PASS |
| tbq4_0/tbq3_0 | 7.7657 | +1.93% | PASS |
| q4_0/q4_0 | 7.7748 | +2.05% | borderline |
| turbo3/turbo3 | 8.0090 | +5.12% | FAIL |
| tbq3_0/tbq3_0 | 8.1489 | +6.96% | FAIL |
With q8_0 keys, you can compress values to 3-bit turbo and get +0.86% PPL -- essentially lossless. With tbq4_0 keys (4-bit), values can go to 3-bit and stay at +1.93%. But 3-bit keys always fail, regardless of V precision. The key vectors carry the geometric structure of the attention manifold -- compress them too hard and the model loses the ability to distinguish which tokens matter.
This asymmetry is the entire strategy. Use 4-bit keys to preserve attention routing, and 3-bit values to save memory where quality is robust. Average it out: 3.6 bits per element for the KV cache, down from 4.5 with q4_0.
Building TBQ4_0 and TBQ3_0¶
With the bug-fixing phase behind me, I built proper TBQ (TurboBlockQuant) types from scratch. TBQ4_0 is 4-bit with a 128-element block (4.125 bpw including the norm scalar). TBQ3_0 is 3-bit at 3.125 bpw. Both use the same Walsh-Hadamard rotation with a fixed orthogonal matrix generated from a deterministic seed, and optimal Lloyd-Max codebooks for the post-rotation Gaussian distribution.
The critical engineering decision was how to handle the rotation inside the flash attention kernel. The naive approach -- rotate every K vector during the attention computation -- would add O(n * d^2) work per query, where n is sequence length and d is head dimension (128). Unacceptable.
The insight: pre-rotate Q once per kernel invocation via shared memory. Load Q into shared memory, apply the 128x128 rotation matrix, write back. Cost: one matrix-vector multiply per attention head per token. Then for every K vector in the sequence, the dot product is just codebook lookups -- 128 table reads and accumulates. The rotation cost is O(d^2) amortized over the entire sequence, making the per-token overhead O(1).
For V accumulation, you accumulate in the rotated domain and apply the inverse rotation to the final output -- again O(d^2) once, not per-token.
// Pseudocode for the FA kernel hot path
// 1. Pre-rotate Q into shared memory (once per head)
__shared__ float Q_rotated[D]; // D = 128
for (int i = tid; i < D; i += blockDim.x)
Q_rotated[i] = dot(rotation_row[i], Q_original);
__syncthreads();
// 2. For each K in the sequence: just codebook lookups
for (int pos = 0; pos < seq_len; pos++) {
float score = 0;
for (int j = 0; j < D; j++) {
uint8_t idx = K_block[pos].qs[j]; // 3 or 4 bit index
score += centroid[idx] * Q_rotated[j];
}
score *= K_block[pos].norm;
// ... softmax, V accumulation ...
}
The implementation spans 7,400 lines of new code across 29 files. New CUDA kernels for SET_ROWS (quantization during KV cache writes), dequantization, and the flash attention vec dot products. New CPU fallback paths. New type registration in ggml. Template instantiations for every combination of K and V types. The rotation matrix itself -- a 128x128 orthogonal matrix -- lives in device constant memory, loaded once at initialization.
I built and validated everything on SM121 (DGX Spark's GPU architecture) with CUDA 13.2. The branch is at mihai-chiorean/turbo3-cuda feat/tbq4-cuda-fa-sm121.
Learning from TheTom¶
I was not the only person building TurboQuant for llama.cpp. TheTom's turboquant_plus fork was the most mature implementation -- tested across 50+ model/hardware combinations with several innovations I adopted.
His turbo4 format was impressive -- q8_0/turbo4 measured at PPL 7.6038, which is -0.19% vs f16. Negative. Better than uncompressed. This is within noise, but it demonstrates that a well-implemented 4-bit codebook with norm correction can be essentially transparent.
I adopted two techniques from his work:
-
Norm correction -- After quantizing a vector, compute the ratio of the original vector's norm to the reconstructed vector's norm. Store it as a scale factor. During dequantization, multiply by this ratio. This corrects for the systematic norm shrinkage that codebook quantization introduces. Simple, effective, and it moved our tbq4_0/tbq4_0 from +1.07% to +0.97%.
-
Sparse V dequant -- For decode (single-token generation), most V vectors get multiplied by near-zero attention weights. Skip the dequant for entries below a threshold. TheTom reported +22.8% decode speed at 32K context. We adopted a tighter threshold (1e-6) for numerical safety.
But for my specific goal -- maximum context extension -- our TBQ format at the tbq4_0/tbq3_0 operating point was actually better than TheTom's nearest equivalent. His turbo4/turbo3 landed at +2.35% PPL (above my 2% quality bar), while our tbq4_0/tbq3_0 held at +1.93%. At the aggressive 3-bit V compression point that unlocks the context gain, our format had the edge.
I documented this analysis on the upstream TBQ pull request, along with the full benchmark data and our CUDA implementation notes.
The open-source TurboQuant ecosystem -- Discussion #20969, TheTom's fork, the upstream PR, spiritbuun's CUDA fork -- is a genuine collaboration where everyone's numbers are public and reproducible. Nobody in this space is keeping secrets. The data speaks.
The Memory Management Challenge¶
One detail that the perplexity tables do not convey: running this work on the same machine that serves my production agent.
The DGX Spark has 128GB. The production llama-server with M2.5 uses ~126GB at 65K context. The test llama-server with Llama-3.1-8B (for perplexity validation) needs ~6GB of model plus KV cache. They cannot run simultaneously.
The workflow: stop the production server, start the test server, run the perplexity suite, record results, stop the test server, restart production. Each cycle takes 20-30 minutes because model loading from NVMe is slow even with mmap. During testing, my nanobot agent is offline.
mmap behavior adds complexity. Linux's page cache is aggressive -- after loading the test model via mmap, those pages compete with the production model's pages for the 128GB physical memory pool. Starting production after a test run triggers a storm of page faults as the OS evicts test model pages and re-loads production model pages. The first inference after a restart is painfully slow.
The mitigation: --no-mmap for the production server (pre-allocates all memory at startup, no page faults during inference) and mmap for the test server (faster startup, acceptable for batch perplexity runs where latency does not matter). This is the kind of systems-level detail that matters when your "GPU server" is also your desktop.
Validation: MiniMax M2.5 on WikiText-2¶
The Llama-3.1-8B numbers above are the primary benchmark (48 layers of attention, well-understood baseline). But the production model is M2.5, so I validated there too:
| Config | PPL | +/- Error | Notes |
|---|---|---|---|
| f16/f16 | 13.2093 | +/-0.751 | BASELINE |
| q4_0/q4_0 | 12.8235 | +/-0.721 | Previous production |
| tbq4_0/tbq4_0 | 12.6849 | +/-0.703 | Safe TBQ |
| tbq4_0/tbq3_0 | 12.2376 | +/-0.660 | Best TBQ |
All error bars overlap. At 4 chunks, the differences are not statistically significant -- TBQ does not degrade quality on MiniMax M2.5, but I cannot claim it is "better than f16" either. Proper validation would need 20+ chunks or downstream task evaluation. The point is: TBQ is safe for production use on this model.
Needle-in-a-Haystack: Does the Model Still Remember?¶
Context extension is pointless if the model cannot actually use the extra tokens. I ran a Needle-in-a-Haystack test: hide a specific fact in thousands of lines of filler text, then ask the model to retrieve it.
At 5K and 10K tokens, the model found the needle immediately. At 10K with the needle placed in the middle, it found the needle but refused to extract it -- MiniMax M2.5's safety layer classified my test as a prompt injection attempt. That is the model's guardrails working correctly, not a TBQ quality issue.
At 30K+ tokens, the tests timed out. Not because the model could not find the needle, but because MiniMax M2.5's thinking mode generates unbounded reasoning chains before answering. At long context, the chain-of-thought can take 15+ minutes. A production fix would be configuring thinking budgets, but that is a model configuration question, not a KV cache compression question.
The key finding: TBQ KV cache compression preserves long-range information retrieval. The model sees and processes the same information -- it is just stored in fewer bits.
The Result: 100K Tokens¶
MiniMax M2.5, DGX Spark, 128GB unified memory. Before and after:
| Config | Context | KV Cache | Decode tok/s | PPL Impact |
|---|---|---|---|---|
| q4_0/q4_0 (before) | 65K | 4,464 MiB | 24.92 | +2.05% vs f16 |
| tbq4_0/tbq3_0 (after) | 100K | 5,492 MiB | 23.83 | +1.93% vs f16 |
100K tokens. 53% more context than the 65K baseline. KV cache uses 5,492 MiB at 100K -- more total memory than before, but per-token cost is lower (3.6 bpw vs 4.5 bpw), so you get more tokens per megabyte. Decode speed drops from 24.92 to 23.83 tok/s -- a 4% hit from the codebook lookups and rotation overhead. The server runs stable.
To put that in practical terms: 65K tokens is roughly a 50-page document. 100K tokens is roughly 75 pages. For my nanobot agent doing RAG over local documents, reading long Discord threads, or maintaining multi-day conversation history, the difference between "fits" and "does not fit" is real. And the quality is actually slightly better than before -- +1.93% vs +2.05% PPL degradation relative to f16.
Production: The Switch¶
The production server is now running with TBQ at 96K context (slightly below the 100K theoretical maximum, for safety margin):
systemctl --user status llama-minimax
llama-minimax.service - MiniMax M2.5 llama-server (TBQ KV, 96K context)
Active: active (running)
Config: --cache-type-k tbq4_0 --cache-type-v tbq3_0 --ctx-size 96000
Same port, same API, same nanobot connection. Just 48% more context window. The model answers the same questions the same way -- it just remembers more of the conversation.
The Upstream Path¶
This work lives on our fork for now. The upstream path to llama.cpp mainline requires coordination:
- PR #21089 needs to land first -- it defines the CPU-only TBQ3_0/TBQ4_0 types with QK_K=256 block format, establishing the ABI that all CUDA implementations must match.
- Our CUDA flash attention kernels would follow as a separate PR, adapted from QK=128 to QK_K=256 blocks.
- SM121-specific fixes and the rotation matrix loading would be follow-up patches.
Our fork uses QK=128 blocks (one rotation group = one attention head dimension), which is binary-incompatible with upstream's QK_K=256 (two rotation groups per block). This was a deliberate choice for SM121 -- 128-element blocks align perfectly with the 128-dimensional attention heads in both Llama and MiniMax, avoiding cross-group bookkeeping in the FA kernel. The upstream port will need to handle the 256-element block split.
What I Learned¶
The most surprising finding: 4-bit keys are non-negotiable, but 3-bit values are free. That asymmetry is the entire trick, and it took a week of wrong models, wrong corpora, and thirteen-million-PPL disasters to find it.
The engineering lessons:
- Test on the right model with the right corpus. Qwen3.5-2B with 6 attention layers and GPL license text cost me three days of chasing phantom bugs. Llama-3.1-8B with WikiText-2 is the standard for a reason.
- K precision is non-negotiable. Every attempt at 3-bit keys failed across every implementation I tested. 4-bit is the floor for keys. Values are forgiving.
- Bypass, don't debug, when time is short. The shadow cache dequant bug was never root-caused. I routed around it with native vec kernels and moved on. Sometimes the fastest path forward is not through the broken code.
- Read other people's code. TheTom's norm correction and sparse V dequant were better than what I would have built from scratch.
The DGX Spark now runs MiniMax M2.5 at 100K context, 23.83 tok/s, with better quality than the previous 65K configuration. Same $3,000 desktop. Same model. Just smarter compression.
But there is something this post does not tell you: 100K context does not mean 100K tokens at full speed. That is the subject of Part 5.
Next: Part 5: The Bandwidth Wall (and What's Next) -- The honest performance analysis. What 273 GB/s actually means for decode speed, why speculative decoding does not help, and what might change.
Mihai Chiorean is a software engineer in San Francisco. Previously CTO at Wendy Labs (edge OS on Yocto/Jetson), EM at Cash App (compliance rules engine, $100B+ txn volume), and engineer at Uber, Block/TBD, and InVision. He builds sovereign AI systems on NVIDIA hardware and contributes to TensorRT-LLM and NemoClaw.