Sovereign AI on a Desktop, Part 5: The Bandwidth Wall (and What's Next)¶
Mihai Chiorean | March 2026
Series: Sovereign AI on a Desktop
Part 1: The Stack -- What I'm running and why Part 2: Five Bugs in NVIDIA's Code -- Fixing TensorRT-LLM for DGX Spark Part 3: The Autoresearcher -- An AI agent that optimizes its own inference Part 4: 100K Context -- KV cache compression via TurboQuant Part 5: The Bandwidth Wall (and What's Next) -- What actually limits a $3,000 desktop (you are here)
Here is what the previous four posts do not tell you if I stop there: 100K context does not mean 100K tokens at full speed.
I built a lot of things. Fixed five bugs in TRT-LLM. Automated inference tuning. Implemented TurboQuant KV compression from scratch. Got MiniMax M2.5 from 65K to 100K context on a $3,000 desktop. The numbers are real and the system runs in production. But the honest conclusion requires talking about what I cannot fix: the bandwidth wall.
The Math¶
Token generation (decode) is memory-bandwidth-bound on the DGX Spark. Every generated token requires reading the active model weights from memory. For MiniMax M2.5 with ~10B active parameters at Q3_K_XL, that is roughly 4.58 GiB per token.
The Spark's memory bandwidth is 273 GB/s.
Theoretical ceiling: 273 GB/s / 4.58 GiB = ~58 tokens/second
In practice, you never hit the theoretical ceiling. Memory access patterns are not perfectly sequential, there is overhead from the attention computation, KV cache reads add to the bandwidth load, and the CUDA runtime has its own costs. The measured ceiling is about 24-25 tok/s at short context, which implies roughly 40-45% bandwidth utilization -- typical for LLM decode workloads.
This ceiling is absolute. No kernel optimization, no quantization trick, no configuration change, no inference engine can break through it. The bytes have to be read from memory, and there are only 273 billion of them available per second.
The Context Scaling Curve¶
The bandwidth wall gets worse as context grows. The attention mechanism reads the entire KV cache on every generated token. More context means more KV cache reads, which means more bandwidth consumed by attention, which means less bandwidth available for weight reads.
I measured this with a four-turn conversation, each turn requesting a ~1000-word essay:
| Turn | Accumulated Context | Decode Speed |
|---|---|---|
| 1 | 1.4K tokens | 21.8 tok/s |
| 2 | 2.6K tokens | 18.0 tok/s |
| 3 | 3.8K tokens | 16.3 tok/s |
| 4 | 4.6K tokens | 15.1 tok/s |
The first-turn drop is steep (the model is still warming up, pages are still being faulted in). The steady-state rate is about 1.5 tok/s lost per additional thousand tokens of context. This is linear and fundamental.
Extrapolating from the bandwidth model:
| Context Length | Estimated Decode Speed | User Experience |
|---|---|---|
| 1-5K | 18-24 tok/s | Fast. Model types faster than you read. |
| 5-10K | 15-20 tok/s | Good. Slight pause noticeable but comfortable. |
| 10-30K | 8-15 tok/s | Usable. Noticeably slower, functional for work. |
| 30-60K | 3-8 tok/s | Slow. The model still works -- it just thinks visibly. |
| 60-96K | 1-4 tok/s | Available but painful. One-shot document analysis, not interactive chat. |
The old q4_0 setup hard-crashed at 65K. The new TBQ setup lets me use 96K without crashing, and the 20-30K comfortable range is genuinely more than I had before. Most of my nanobot conversations stay under 10K tokens, where the speed is indistinguishable from the old configuration.
The honest conclusion: I got 48% more context ceiling and roughly 30% more usable context, at the cost of 4% peak decode speed. For a $3,000 desktop running a 229B model, I will take that trade.
Tensor Cores: The Real Story¶
I initially believed llama.cpp was leaving the Spark's 192 tensor cores completely idle. I was wrong, and the real picture is more nuanced than "tensor cores good, CUDA cores bad."
Prefill (prompt processing): Tensor cores ARE used. - INT8 tensor core MMA instructions handle all quantized weight GEMMs (38.7% of prefill time in nsys profiling) - FP16 tensor core MMA handles flash attention with batch > 1 - Prefill on the Spark runs at ~2,800 tok/s -- this is compute-bound, and tensor cores help
Decode (token generation): Tensor cores are NOT used, and this is correct. - MMVQ (matrix-vector) kernels run on CUDA cores - Vector flash attention runs on CUDA cores - Decode is memory-bandwidth-bound at 273 GB/s -- tensor cores would not help because the bottleneck is reading weights, not computing on them
The misconception I had -- and that I see repeated in forums -- is that tensor cores are wasted on llama.cpp. They are not wasted; they are used where they matter (prefill). They are not used where they would not help (decode). The DGX Spark's 427 TFLOPS of FP4 compute is impressive on paper, but for single-user decode of a bandwidth-bound model, raw TFLOPS is not the constraint.
This is why the TRT-LLM NVFP4 path (Part 2) would not have helped decode even if the model had fit. NVFP4 uses FP4 tensor cores for the weight GEMMs, but decode is still dominated by memory reads. You would get the same ~24 tok/s at short context regardless of whether the GEMM runs on CUDA cores via INT8 MMQ or on tensor cores via NVFP4. The bandwidth wall is the same.
Where NVFP4 would help: prefill. FP4 tensor cores have ~4x the throughput of INT8 tensor cores. If the 2,800 tok/s prefill is compute-bound (not bandwidth-bound), then NVFP4 could push prefill to 5,000-10,000 tok/s. But for interactive chat, decode speed is what the user feels, and that number does not change.
Speculative Decoding: Tested, No Benefit¶
MiniMax M2.5 ships with a built-in MTP (Multi-Token Prediction) head designed for speculative decoding. The idea: predict multiple tokens in parallel, verify them in a single forward pass, accept the ones that match. On compute-bound hardware, this can provide 2-3x speedup.
On bandwidth-bound hardware, it provides nothing.
I tested n_draft from 1 to 16 with the autoresearcher (Part 3). No configuration improved decode speed. The reason is straightforward: speculative decoding generates N candidate tokens, then verifies them by reading the model weights once for N tokens instead of N times. The savings come from reduced compute. But on the Spark, the bottleneck is the memory read, and you still have to read all the weights for the verification pass. The speculation adds overhead (generating candidates) without reducing the bottleneck (memory bandwidth).
This is counterintuitive if you are used to datacenter GPUs where speculation is a standard optimization. On a B200 with 8 TB/s bandwidth, decode is compute-bound for large models, and speculative decoding genuinely helps. On a Spark with 273 GB/s, decode is bandwidth-bound, and speculation is all cost no benefit.
[TODO: confirm exact n_draft values tested and results from autoresearcher JSONL logs]
The Mac Studio Comparison¶
The DGX Spark's closest competitor for local LLM inference is the Mac Studio M4 Ultra:
| Spec | DGX Spark | Mac Studio M4 Ultra |
|---|---|---|
| Memory | 128 GB | 192 GB |
| Bandwidth | 273 GB/s | ~800 GB/s |
| Tensor cores | 192 (FP4: 427 TFLOPS) | N/A (Apple Neural Engine) |
| Price | $3,000 | ~$6,000-8,000 |
| Inference engine | llama.cpp, TRT-LLM | llama.cpp, MLX |
The Mac Studio has roughly 3x the memory bandwidth. For a bandwidth-bound decode workload, that translates directly: ~2-3x faster decode. M2.5 at Q3_K_XL would decode at roughly 50-70 tok/s on the Mac Studio, versus 24 tok/s on the Spark.
The Spark's advantages: - Price. $3,000 vs $6,000-8,000 for the 192GB Mac Studio. - CUDA ecosystem. TRT-LLM, CUTLASS, nsys profiling, the entire NVIDIA toolchain. If you want to write custom CUDA kernels (like the TBQ flash attention kernels in Part 4), the Spark is the only option. - Tensor cores for prefill. 2,800 tok/s prefill on the Spark likely exceeds what the Mac Studio achieves, because prefill is compute-bound and INT8 MMA is fast.
The Mac Studio's advantages: - Bandwidth. ~3x, translating to ~2-3x decode. This is the number that matters for interactive chat. - Memory. 192GB vs 128GB. M2.5 at NVFP4 (~228GB) does not fit on either, but higher-quality quantizations have more room. - Maturity. MLX and llama.cpp are well-optimized for Apple Silicon. The Spark's software stack is still catching up (see: five TRT-LLM bugs).
[TODO: verify Mac Studio M4 Ultra pricing and actual decode tok/s for M2.5 if benchmarks are available]
For my use case -- a personal agent running 24/7, where I also want to write and test custom CUDA kernels -- the Spark is the right choice. If I only cared about decode speed and did not need CUDA, the Mac Studio would win on raw performance per dollar spent on bandwidth.
What Might Actually Help¶
Given the bandwidth wall, what could meaningfully improve decode on the Spark?
Things That Would Not Help¶
-
Better quantization of weights. Going from Q3_K_XL (3-bit) to Q2_K (2-bit) would reduce the model size but also reduce quality significantly. The perplexity hit is too large. And it only moves the ceiling from ~58 tok/s to ~87 tok/s theoretical, ~35 tok/s practical -- a modest gain for a large quality loss.
-
More tensor core utilization. Tensor cores are not the bottleneck for decode. Using FP4 tensor cores for weight GEMMs (via MXFP4 format) would not help because the weights still need to be read from memory first.
-
Batching. Serving multiple users simultaneously amortizes the weight reads across requests, improving throughput per token. But I am a single user. Batch=1 is my workload.
Things That Might Help¶
-
Sparse MoE loading. MiniMax M2.5 has 256 experts but only activates a handful per token. Currently, all expert weights are resident in memory. If the router's selection could be predicted one step ahead, only the needed expert weights would need to be loaded, reducing the per-token bandwidth requirement. This is an active research area. [TODO: reference MIT-109 sparse MoE loading if this is a real project identifier]
-
Block-scaled tensor core KV cache. The FP4 tensor cores support block-scaled operations natively. If the KV cache were stored in a block-scaled FP4 format compatible with the MMA instruction, the attention computation could run on tensor cores during decode instead of CUDA cores. This would not help the weight-read bottleneck, but it would reduce the compute overhead of long-context attention, making the context scaling curve less steep. [TODO: reference MIT-104 block-scaled KV if this is a real project identifier]
-
Hardware upgrade. The most direct path to faster decode is more bandwidth. NVIDIA's next generation of unified memory devices, or a future DGX Spark with LPDDR6 or HBM, would move the wall. Alternatively, running M2.5 across two Spark nodes with tensor parallelism would double the effective bandwidth. The 3-node DGX Spark clustering hack demonstrates that this is possible, though the networking overhead adds its own complexity.
-
Better models. A future MoE model that activates fewer parameters per token (lower active-to-total ratio) would read fewer bytes from memory per token, directly improving decode speed. M2.5 activates ~10B of 229B (4.4%). A model that activated 2% would decode roughly 2x faster. Model architecture advances are arguably the highest-leverage improvement path.
Where Things Stand¶
The DGX Spark, March 2026:
| What | Status | Post |
|---|---|---|
| MiniMax M2.5 inference | Running 24/7 at 96K context | Part 1 |
| TRT-LLM NVFP4 tensor core path | Bugs fixed, model does not fit | Part 2 |
| Optimal llama.cpp configuration | Found by autoresearcher | Part 3 |
| TurboQuant KV compression | +53% context, +1.93% PPL | Part 4 |
| Bandwidth wall | 273 GB/s, fundamental | This post |
The system works. My nanobot agent runs on my desk, answers questions on Discord, manages my schedule, searches the web, and remembers conversations across days. It does this on a $3,000 box with no cloud dependency. At short context (under 10K tokens), the experience is indistinguishable from a cloud API. At long context, it slows down but still functions.
Is it the fastest local inference setup? No. A Mac Studio with 3x the bandwidth would decode faster. Is it the cheapest? No. You can run smaller models on cheaper hardware. Is it sovereign? Yes. I own every layer: hardware, model, inference engine, agent, communication channels. I can read the CUTLASS source when the GEMM fails. I can write custom CUDA kernels when the KV cache needs new quantization types. I can fix bugs in NVIDIA's code and submit PRs. Nothing about this system depends on an external service staying online, remaining affordable, or choosing not to change its terms of service.
What Comes Next¶
Three things I am watching:
-
TRT-LLM PR merges. Four of my five PRs are still open. When they merge, every RTX 5090 and DGX Spark user gets MoE inference on TRT-LLM. For smaller MoE models that fit in NVFP4, this is the fast path.
-
TurboQuant upstream. PR #21089 defines the TBQ types for llama.cpp mainline. Our CUDA implementation would follow. When this lands, anyone running llama.cpp on any Blackwell GPU gets TurboQuant KV compression without maintaining a fork.
-
Qwen 3.5 400B MoE. NVIDIA is adding support (#12265, #12302). If a 400B MoE model with similar active-to-total ratio as M2.5 ships in GGUF, and if Q2_K or Q3_K gets it under 128GB, the Spark could run something significantly more capable. The bandwidth wall still applies, but more capable per token is worth slower tokens.
The hardware is not changing. The software is still catching up. And the bandwidth wall is the honest answer to "how fast can this go": about 24 tok/s at short context on 273 GB/s, dropping linearly with context length, bounded by physics rather than engineering.
Six months ago, running a 229B model required a cloud API and a monthly bill. Now it runs on my desk at 96K context, 24 tok/s, no network connection required. The total cost: $3,000 in hardware, five weeks of engineering, and five bug fixes that NVIDIA had not gotten to yet. That is the real price of sovereign AI -- not just the box, but the willingness to read the CUTLASS source when the box does not work.
The DGX Spark is ready. The software almost is.
Mihai Chiorean is a software engineer in San Francisco. Previously CTO at Wendy Labs (edge OS on Yocto/Jetson), EM at Cash App (compliance rules engine, $100B+ txn volume), and engineer at Uber, Block/TBD, and InVision. He builds sovereign AI systems on NVIDIA hardware and contributes to TensorRT-LLM and NemoClaw.