Skip to content

Sovereign AI on a Desktop, Part 1: The Stack

Mihai Chiorean | March 2026

Series: Sovereign AI on a Desktop

Part 1: The Stack -- What I'm running and why (you are here) Part 2: Five Bugs in NVIDIA's Code -- Fixing TensorRT-LLM for DGX Spark Part 3: The Autoresearcher -- An AI agent that optimizes its own inference Part 4: 100K Context -- KV cache compression via TurboQuant Part 5: The Bandwidth Wall -- What actually limits a $3,000 desktop


I run a 229-billion parameter model on my desk. Not in the cloud. Not on a $30,000 server. On a DGX Spark -- NVIDIA's $3,000 Grace Blackwell desktop with 128GB of unified memory and a GB10 GPU.

The model is MiniMax M2.5, a Mixture-of-Experts architecture with 256 experts and 10 billion active parameters per token. It powers my personal AI agent -- a sovereign, self-hosted assistant that connects to Discord, Telegram, and Slack, runs shell commands on my machines, searches the web, and manages my schedule. No cloud API keys. No per-token billing. No data leaving my network.

This is sovereign AI in the literal sense: I own every layer of the stack, and no external entity can throttle, monitor, or revoke it.

What "Sovereign" Actually Means

People use "self-hosted" to mean "I run Ollama on a Mac." That is a start, but it is not sovereignty. Sovereignty means owning every layer -- hardware, inference engine, model weights, agent logic, communication channels, and the ability to fix any of them when they break. It means no single point of external dependency.

Here is what my stack looks like:

nanobot (Python, ~4K lines)
  |-- Discord / Telegram / Slack channels
  |-- MCP tool integration (shell, web search, scheduling)
  |-- ChromaDB vector store for persistent memory
  |-- Multi-LLM routing (local primary, API fallback)
  |-- localhost:8001 --> llama-server
                           |-- MiniMax M2.5 (229B MoE, Q3_K_XL)
                                |-- DGX Spark GB10 (128GB unified, SM121)

Every layer is mine. The model weights are on my NVMe. The inference runs on my GPU. The agent logic is my Python. The conversations stay on my machine. When two bugs in NVIDIA's stack prevented the next performance leap, I read the CUTLASS source, understood the hardware constraints, and fixed them.

That is what sovereign AI means in practice: not just running someone else's stack, but being able to fix it when it breaks.

The Agent: Nanobot

Nanobot is the reason all this hardware and optimization exists. It is a Python agent that lives on Discord, Telegram, and Slack, connecting to whichever LLM backend is running on the Spark.

What it does:

  • Chat with memory. Every conversation is stored in ChromaDB. When I ask "what was that paper I mentioned last week," it retrieves the relevant context from vector search and includes it in the prompt. This is not a toy RAG demo -- it is my actual workflow for retaining information across days and weeks.

  • Run shell commands. Nanobot has MCP tool access to run commands on my machines. I can ask it to check disk space, restart a service, look up a file, or run a build. The agent has a permission model -- some commands run immediately, others require confirmation.

  • Search the web. When the model's training data is insufficient, nanobot can search the web, read pages, and synthesize answers. It knows when to search (technical questions with recent developments) and when to rely on its own knowledge.

  • Manage my schedule. Calendar integration via MCP. It can create events, check availability, and remind me of upcoming meetings.

  • Route between models. The primary brain is MiniMax M2.5 on the Spark. But nanobot can route to cloud APIs (Claude, GPT-4) as fallback when the local model is overloaded, restarting, or when the task specifically benefits from a different model. The routing is explicit -- I can see which model answered, and the default is always local.

The motivation for everything that follows -- the hardware, the five TRT-LLM bug fixes, the autoresearcher, the TurboQuant KV compression -- is making this agent better. Faster responses, longer context, higher quality, always available. A personal AI that runs 24/7 on my desk, not on someone else's computer.

The Hardware: DGX Spark

The DGX Spark is a strange machine. It has a Grace ARM CPU and a Blackwell GB10 GPU sharing a single 128GB LPDDR5x memory pool via NVLink-C2C. There is no separate VRAM -- CPU and GPU see the same physical memory at the same addresses. This means you can load models that would never fit in a traditional GPU's dedicated memory, but you are bottlenecked by 273 GB/s memory bandwidth instead of the 2+ TB/s you would get on a datacenter B200.

The key specs:

Spec Value
GPU GB10 (Blackwell, SM121)
Memory 128 GB LPDDR5x unified
Bandwidth 273 GB/s
Tensor cores 192 (FP4: 427 TFLOPS, FP16: ~107 TFLOPS)
CPU Grace ARM, 20 cores
Shared memory/block ~99 KiB (vs ~227 KiB on datacenter B200)
Price $3,000

The unified memory architecture is the defining feature. A datacenter B200 has 192GB of HBM3e at 8 TB/s, but it costs $30,000+ and lives in a rack. The Spark has 128GB of shared memory at 273 GB/s, and it sits on my desk. For a model like M2.5 that needs ~95GB just for weights, the Spark is the cheapest hardware that can load it at all.

The bandwidth gap is real and it dominates everything. I will talk about the bandwidth wall honestly in Part 5. But at 24 tokens/second for interactive chat, the Spark is fast enough for a personal agent. Not fast enough for production serving to hundreds of users -- but that was never the goal.

The Model: MiniMax M2.5

I chose MiniMax M2.5 because it is the best open model at its size class. It outperforms many 70B dense models on benchmarks while only activating 10 billion parameters per token -- the MoE (Mixture-of-Experts) advantage. The full model has 229 billion parameters across 256 experts, but the router selects a handful per token, keeping the computation tractable.

Key architecture details:

  • 229B total parameters, ~10B active per token. The MoE router selects experts per-token, so inference compute scales with active parameters, not total.
  • 62 layers, 48 attention heads, 8 KV heads. The GQA (Grouped Query Attention) with 8 KV heads is important -- it means the KV cache is relatively small per token, which matters enormously for context length.
  • 1M token context window. The architecture supports up to 1M tokens. In practice, memory and bandwidth limit how much you can actually use.
  • Built-in MTP (Multi-Token Prediction). The model ships with a speculative decoding head. Whether this helps on bandwidth-limited hardware is a question I will answer in Part 5.

At Q3_K_XL quantization (roughly 3 bits per weight), M2.5 fits in ~95GB. That leaves 33GB of the 128GB pool for the KV cache, llama-server overhead, the operating system, and any other processes. The margins are thin but workable.

The Inference Stack: llama.cpp

The inference engine is llama.cpp, served via llama-server as a systemd service. I chose llama.cpp for three reasons:

  1. It works today. No conversion pipeline, no engine build step, no ONNX export. Download the GGUF file, point llama-server at it, get an OpenAI-compatible API on localhost:8001.

  2. It supports the quantization I need. Q3_K_XL at ~95GB is the sweet spot -- it fits in 128GB with room for KV cache. TRT-LLM's NVFP4 format would be ~228GB for M2.5, which does not fit. That is the punchline of Part 2.

  3. It already uses tensor cores. This surprised me. I initially assumed llama.cpp was pure CUDA cores, leaving the Spark's 192 tensor cores idle. Wrong. During prompt processing (prefill), llama.cpp uses INT8 tensor core MMA instructions for quantized weight GEMMs and FP16 tensor core MMA for flash attention. I confirmed this via nsys profiling -- 38.7% of prefill time is in INT8 MMA kernels. The tensor cores are only idle during single-token decode, which is memory-bandwidth-bound regardless.

The production configuration:

# /home/mihai/.config/systemd/user/llama-minimax.service
[Service]
ExecStart=/home/mihai/workspace/llama.cpp/build/bin/llama-server \
  --model /data/models/MiniMax-M2.5-UD-Q3_K_XL.gguf \
  --host 0.0.0.0 --port 8001 \
  --ctx-size 96000 \
  --cache-type-k tbq4_0 --cache-type-v tbq3_0 \
  --flash-attn on \
  --n-gpu-layers 999 \
  --threads 10

It runs 24/7. Systemd restarts it on crash. A health-check gate in the nanobot startup sequence waits for /health to return 200 before accepting Discord messages. The model takes about 10 minutes to load 95GB from NVMe into memory.

The Two-Path Strategy

When I started this project, I pursued two paths simultaneously:

Path A: TensorRT-LLM with NVFP4. NVIDIA's native inference engine with hardware-accelerated FP4 tensor core compute. The promise: the Spark's 427 TFLOPS of FP4 compute could mean significantly faster inference than llama.cpp's software dequantization. The reality: five bugs in NVIDIA's code, and even after fixing all of them, the NVFP4 model at ~228GB does not fit in 128GB. The tensor core path is a dead end for this model on this hardware. That story is Part 2.

Path B: llama.cpp with GGUF quantization. The pragmatic path. Q3_K_XL at ~95GB fits. It runs at 24 tok/s. The autoresearcher found that q4_0 KV cache quantization is lossless on M2.5, unlocking 65K context. TurboQuant KV compression later pushed that to 100K. This path won -- not because it is theoretically superior, but because it fits in memory.

The lesson: on a 128GB machine, memory is the constraint that dominates everything else. Tensor core throughput does not matter if the model does not fit. The best inference engine is the one that runs.

The Numbers

Production performance as of March 2026:

Metric Value
Model MiniMax M2.5, Q3_K_XL (~95GB)
Context window 96,000 tokens
KV cache format tbq4_0 keys / tbq3_0 values (3.6 bpw)
Decode speed (short context) ~24 tok/s
Decode speed (10K context) ~15 tok/s
Prefill speed ~2,800 tok/s
Memory usage ~126GB / 128GB
Uptime 24/7 systemd service
Quality (PPL vs f16 KV) +1.93% (within 2% bar)

24 tokens per second is about 1,800 words per minute. For a personal agent handling one conversation at a time, this is more than fast enough. The model types faster than I can read.

The Cost Argument

The DGX Spark costs $3,000. At 24 tok/s, it generates roughly 2 million tokens per day (if running at full capacity continuously). At Claude API pricing (~$15/MTok for output), that would cost about $30/day or $900/month. The Spark pays for itself in about 3.5 months of heavy use, and after that, every token is free.

[TODO: verify current Claude API pricing and adjust calculation]

In practice, my usage is maybe 100K tokens per day -- not 2M. At that rate, the savings are more modest: maybe $1.50/day vs API pricing, with payback in years. The economic argument is not the primary motivation. The sovereignty argument is: my conversations do not leave my machine, my access cannot be revoked, and my agent runs whether or not I have internet.

What Is Next

This post described what I am running and why. The next four posts describe how I got here:

  • Part 2 tells the story of fixing five bugs in NVIDIA's TensorRT-LLM -- and discovering that the tensor core path is a dead end for this model.
  • Part 3 describes the autoresearcher that found the optimal llama.cpp configuration, including the critical discovery that q4_0 KV cache is lossless on M2.5.
  • Part 4 is the deep dive into TurboQuant KV compression -- seven bugs, the wrong model, the wrong corpus, thirteen million perplexity, and eventually 100K context.
  • Part 5 is the honest performance analysis: what 273 GB/s actually means, why speculative decoding does not help, and what might.

Next: Part 2: Five Bugs in NVIDIA's Code -- I tried to use tensor cores. NVIDIA's code had other plans.


Mihai Chiorean is a software engineer in San Francisco. Previously CTO at Wendy Labs (edge OS on Yocto/Jetson), EM at Cash App (compliance rules engine, $100B+ txn volume), and engineer at Uber, Block/TBD, and InVision. He builds sovereign AI systems on NVIDIA hardware and contributes to TensorRT-LLM and NemoClaw.