The MacBook Won the Speed Test. My GX10 Won the Survival Test.

What a 261K-token local AI benchmark taught me about speed, context, and why tokens per second is not enough

Augmented Mind: Think with AI and Manolo Remiddi

May 11, 2026

I did not expect the result to bother me as much as it did.

I was watching IndyDevDan’s benchmark of local AI on high-end Apple Silicon, and at first the result was exactly what you would expect. The MacBook M5 Max looked fantastic. Gemma 4 26B was running at roughly 115 tokens per second, which is the kind of number that makes local AI feel less like a compromise and more like the future arriving quietly on your desk.

Then I looked at my ASUS Ascent GX10.

On my machine, the same family of model usually runs closer to 55 or 60 tokens per second in normal use. That is not slow, but it is clearly not 115. So, for a moment, I had the uncomfortable feeling that many people have after buying expensive hardware: maybe I bought the wrong thing. Maybe I should have gone for the fully loaded MacBook instead. Maybe the beautiful, portable, all-in-one machine was not only more convenient, but also simply better.

But then the benchmark moved from normal prompts into larger context sizes, and the story changed.

At around 32K tokens, the MacBook result shown in the benchmark started producing errors or failed answers. That was the part that surprised me, because in my daily use of Gemma 4 26B-A4B on the GX10, I had not seen that kind of collapse. Maybe I had missed it. Maybe I had simply not pushed the model hard enough in a controlled way. Maybe it was hallucinating and I was not catching it.

So I tested it.

I pushed Gemma 4 26B-A4B on my ASUS GX10 all the way to 261,537 prompt tokens, which is very close to the model’s advertised maximum context window.

The result was not glamorous. It was not fast. It was not the kind of experience you would describe as smooth. But it was correct.

That is the part that matters.

The MacBook appeared to win the speed test. My GX10 won the survival test.

And if you care about local AI agents, long coding sessions, research workflows, tool use, or anything that lives beyond a short chat, the survival test may be the more important one.

Why this matters to me

I am not interested in local AI only because it is technically interesting, although it definitely is. My reason is more personal than that.

I have been using AI since ChatGPT came out. At first, it was a productivity tool. It helped me write faster, think faster, organise ideas, summarise things, and move through work with less friction. Then, slowly, it became something else. It stopped being just a faster way to do things I already knew how to do and started giving me access to things I could not do before.

Coding is the obvious example. I am not pretending that AI turns everyone into a senior engineer overnight, because it does not. But it gives you leverage. It lets you experiment, build prototypes, debug errors, understand unfamiliar concepts, and move through technical work that would previously have been blocked by your own limitations.

Once that happens, AI stops feeling like software.

It starts feeling like infrastructure.

That changes the emotional relationship you have with it. A tool is something you use when it is available. Infrastructure is something you build your life and work around. And once AI becomes part of how you run a business, write, code, research, create, and make decisions, it becomes uncomfortable to have all of that capability sitting entirely in someone else’s cloud.

Cloud AI is extraordinary. I use it every day and I will keep using it. For most people, most of the time, it is still the best choice. It is cheaper, easier, and far more powerful than anything they can run locally.

But I also want one system that I control.

I want something private, local, and mine. Something that does not disappear if a subscription changes. Something that does not get rate-limited in the middle of a project. Something that does not depend on a company deciding that my favourite model should be retired, filtered, repriced, or moved behind an enterprise plan.

That is why I bought the ASUS Ascent GX10. Not because it replaces cloud AI today, but because it gives me a local foundation. It gives me a machine I can learn on, test on, break, repair, configure, and trust in a different way.

The machine behind the test

The ASUS Ascent GX10 is not a normal mini PC with a large amount of memory. It is built around NVIDIA’s GB10 Grace Blackwell Superchip, the same class of platform used by NVIDIA DGX Spark systems. It has 128 GB of unified memory and is designed for local AI development, inference, experimentation, and deployment.

That detail matters because local AI is not only about the model. It is about the stack around the model.

This is one of the things that becomes obvious only after you spend time running models yourself. People talk as if the model is the whole story, but it is not. The same model can behave very differently depending on the runtime, quantisation, memory system, backend support, attention implementation, KV cache handling, chat template, and frontend.

A model running through MLX on Apple Silicon is not having the same experience as a model running through vLLM on an NVIDIA Blackwell system. A model loaded in Ollama is not necessarily behaving the same way as the same model served through a carefully configured vLLM stack. A model that fits in memory at 8K context may become unstable at 64K because the model weights were never the whole memory problem in the first place.

That is why this comparison interested me. I was not only comparing two machines. I was comparing two local AI philosophies.

The MacBook is a beautiful general-purpose computer that happens to be extremely good at local AI. The GX10 is a more specialised local AI box built around the NVIDIA ecosystem. One is more convenient. The other is closer to the kind of infrastructure stack used for serious inference.

At short context, convenience and raw speed can win. At long context, the deeper stack starts to matter.

Why Gemma 4 26B-A4B is the right model to expose this problem

The model I tested is Gemma 4 26B-A4B, and it is interesting because it sits in a very attractive part of the local AI landscape.

It is a mixture-of-experts model. In simple terms, that means the full model has around 26 billion parameters, but it does not activate all of them for every token. Instead, only around 4 billion parameters are active per token. That is why the model can feel like something larger while running more like something smaller.

This is exactly the sort of design that makes local AI exciting. You get a model that is capable enough to be genuinely useful, fast enough to be practical, and efficient enough to run on machines that are powerful but still local.

The other important feature is context. Gemma 4 26B-A4B supports a very large context window, up to 256K tokens. On paper, that sounds perfect for agents. A coding agent could hold more files, more logs, more history, and more instructions. A research agent could hold more documents. A business agent could keep more project state alive without constantly summarising and forgetting.

But there is a trap hidden inside every advertised context window.

A model supporting 256K tokens does not mean your machine can use 256K tokens well. It does not mean your runtime can allocate the memory cleanly. It does not mean your KV cache is efficient enough. It does not mean the model remains accurate at that size. It does not mean the experience is usable.

There is a very big difference between theoretical context and reliable context.

That difference is what this test exposed.

The video that triggered the doubt

The benchmark I watched used live-bench, which is much more useful than a simple tokens-per-second test because it does not only ask how fast the model runs. It also asks whether the model still answers correctly as the context grows.

That is exactly the kind of benchmark local AI needs.

A model that produces 100 tokens per second but gives the wrong answer is not impressive. It is just wrong quickly. For normal chat, you might not notice immediately. For coding, research, and agentic work, you absolutely will.

In the video, the MacBook M5 Max looked excellent at normal context sizes. The short-context speed was much better than what I was seeing on the GX10. But once the prompt grew toward 32K tokens, the model began to fail the benchmark. That is when my initial jealousy turned into curiosity.

I had been using Gemma 4 26B-A4B locally as Hermes daily runner and had not experienced that limit in the same way. So I wanted to know whether my machine was actually handling long context better, or whether I had simply not been paying enough attention.

The only way to find out was to run the test.

What happened when I pushed the GX10

The first thing I noticed was that the slowdown was gradual rather than catastrophic.

At 8K context, my GX10 generated at 39.84 tokens per second. At 16K, it was still at 37.93 tokens per second. At 32K, where the MacBook benchmark path had started showing errors, the GX10 was still generating correctly at 35.46 tokens per second. At 64K, it dropped to 31.18 tokens per second, and by 96K it was at 27.72 tokens per second.

That is a real slowdown. You can feel it. The machine is not pretending that long context is free. But this is the important part: it was still working, and it was still correct.

Then I pushed further.

At a 192K target context, the actual prompt size was 196,583 tokens. Prompt processing ran at 1,128.54 tokens per second, generation dropped to 20.35 tokens per second, and the full run took 186.97 seconds. That is no longer a snappy interaction. It is the kind of result where you become aware that you are asking a local machine to do something heavy.

But the answer was correct.

Then I pushed it close to the limit. At near maximum context, the actual prompt size was 261,537 tokens. Prompt processing ran at 954.96 tokens per second, generation dropped to 17.68 tokens per second, and the full run took 287.75 seconds.

Again, the answer was correct.

That result changed how I thought about the machine. I was no longer looking at the GX10 as simply “slower than the MacBook”. That was true only if the benchmark ended early. Once the context became large enough, the more important question was no longer speed. It was whether the model could keep the task intact.

The GX10 could.

Speed is seductive, but stability is what agents need

The easy headline would be “MacBook versus GX10”, but that is too shallow.

The MacBook is not a bad local AI machine. Quite the opposite. It may be the better machine for most people. It is portable, quiet, efficient, beautifully integrated, and very fast at normal context sizes. If your local AI use is mostly chat, short coding questions, writing help, and moderate context sizes, the MacBook experience can be excellent.

The GX10 is a different kind of machine. It is less convenient. It is more specialised. It is not a laptop. It does not give you the same all-in-one lifestyle advantage. And at short context, it may not give you the same speed.

But long-context local AI is not a sprint. It is more like an endurance race where the winner is not the runner who starts fastest, but the one who is still moving when the road gets ugly.

That is why the benchmark matters. The MacBook result looked stronger at the beginning. The GX10 looked stronger when the context became large enough to stress the whole system.

For agentic AI, that second part matters a lot.

Agents do not live in clean, short prompts. They read files. They call tools. They receive errors. They write code. They inspect logs. They revise their plan. They carry system prompts, developer instructions, user instructions, tool schemas, previous attempts, summaries, and all the strange accumulated mess of a real task.

A user may feel like they are only having a conversation. Underneath, the agent may be dragging a huge amount of state behind it.

This is why 32K tokens can disappear very quickly. It sounds like a lot when you imagine a chat window. It does not feel like a lot when an agent is inside a codebase, reading files, generating patches, running tests, and trying to remember what it already tried.

So the question is not only whether the machine can run the model. The question is whether the model remains coherent when the task becomes heavy.

That is where my GX10 result became genuinely valuable.

The hidden cost of context

Long context has two different costs, and most benchmarks do not make the difference clear enough.

The first cost is prefill. This is the part where the model reads the prompt. If you send a 100K-token prompt, the model has to process those 100K tokens before it can generate the answer. In my 192K run, the GX10 processed the prompt at 1,128.54 tokens per second. In my 261K run, it processed the prompt at 954.96 tokens per second. That is the machine reading the context.

The second cost is decode. This is the part where the model generates the answer token by token. This is the number people usually talk about when they mention tokens per second. At 8K context, my generation speed was 39.84 tokens per second. Near the maximum context, it was 17.68 tokens per second.

That decline is not surprising. The longer the context, the more memory pressure the system has to deal with. What was surprising to me was not that the model slowed down. It was that it slowed down gracefully.

There was no sudden collapse. No obvious cliff. No moment where the model simply lost the thread and began producing nonsense. The experience became slower, but the task remained intact.

That distinction is important.

A slow correct answer is annoying.

A fast wrong answer is dangerous.

The real memory problem is not just the model

When people first get into local AI, they often think the main question is whether the model fits in memory. That is understandable. Model size is visible. You download a file, you see how many gigabytes it is, and you ask whether your machine can load it.

But with long context, the model is only part of the memory problem.

The context itself becomes expensive because of the KV cache. The KV cache stores attention information from previous tokens so the model can keep generating efficiently. As the context grows, the cache grows with it. That means a model can be perfectly comfortable at 8K tokens and then become slow, unstable, or impossible at 64K, 128K, or 256K.

This is one reason advertised context windows can be misleading. The architecture might support a huge context window, but your local stack still has to carry the memory burden of that context.

Agents make this worse because they are context-expansion machines. They do not just answer. They accumulate. Every file, log, tool call, generated patch, explanation, failed attempt, and summary adds more weight.

Eventually the problem is not whether the model knows enough.

The problem is whether the system can keep holding everything without breaking.

Compaction helps, but it also creates amnesia

Most long-running AI systems eventually rely on compaction. The idea is simple: when the context gets too large, the system summarises the old conversation, keeps the summary, and discards the raw history.

This is necessary. Nobody wants to keep 200K tokens of raw conversation forever if a good summary can preserve the important state. Without compaction, long-running agents would become too expensive, too slow, or impossible to continue.

But compaction is also risky.

A good summary keeps the soul of the task alive. It remembers the goal, the constraints, the files changed, the bugs found, the decisions made, the tests run, and the things that still need to happen.

A bad summary creates amnesia.

The agent forgets why it made a decision. It forgets that a bug was already fixed. It forgets that the user gave a constraint three steps ago. It forgets the architecture it was supposed to preserve. It confidently continues, but the thread has been damaged.

This is why I care about reliable long context. Even if I do not want to work interactively at 261K tokens, having that headroom gives the system more time before compaction becomes necessary. It means fewer forced summaries, less memory loss, and a better chance that the agent can keep its footing during a long session.

Long context is not only about putting more text into the prompt.

It is about delaying amnesia.

Why the GX10 result makes sense

The GX10 result is not magic. It is what you would hope to see from a machine built around this kind of workload.

The GB10 Grace Blackwell platform is designed for modern AI inference. The NVIDIA ecosystem has mature support for CUDA-based workloads. vLLM is built for serious serving rather than casual chat alone. FP4 and NVFP4 paths can reduce the memory footprint of the model. FP8 KV cache can help reduce the memory pressure created by long context. MoE backend support matters because Gemma 4 26B-A4B is a mixture-of-experts model and those experts need to be handled correctly.

These details are not glamorous, but they decide the outcome.

By contrast, Apple Silicon is excellent hardware and MLX is genuinely impressive. I do not want this to sound like a dismissal of the MacBook, because that would be stupid. Apple’s machines are extraordinary for local AI considering their form factor, power efficiency, and general-purpose usefulness.

But for this specific workload, Gemma 4 26B-A4B at very long context, the GX10/DGX Spark-style stack appears to have an advantage. It is not that the model becomes smarter on NVIDIA hardware. It is that the full stack seems better able to keep the model stable as context grows.

That may change. MLX will improve. Frontends will improve. Apple Silicon inference will continue to evolve. A future version of this test might look different.

But today, with my setup and this model, the GX10 behaved more like the machine I hoped I was buying.

The conclusion is not “GX10 beats MacBook”

This is where the comparison needs to be honest.

If you ask which machine is better, the only serious answer is: better for what?

If you want one beautiful machine that you can carry around, use as your main computer, edit video on, write on, code on, and also run local AI at high speed, a high-end MacBook Pro is still incredibly compelling. For many people, it is the correct choice.

If you want a specialised local AI box for testing models, running agents, experimenting with inference stacks, pushing context windows, and learning how AI infrastructure behaves under pressure, the GX10 makes more sense.

The MacBook gives you convenience and short-context speed.

The GX10 gives you endurance and stack alignment.

That is the real distinction.

I do not regret buying the GX10 because the thing I care about most is not only the first answer in a conversation. I care about what happens after the context is full of work, mistakes, logs, files, instructions, and state.

That is where the GX10 earned my confidence.

The dangerous lie of context windows

The biggest lesson from this test is that context windows need to be treated with suspicion.

When a model says it supports 256K tokens, that is useful information, but it is not the full truth. It tells you what the model architecture can theoretically support. It does not tell you what your machine can load, what your runtime can handle, what your memory system can sustain, what your KV cache will cost, or whether the model will still answer correctly.

There are really several context windows hiding inside the one number.

There is the theoretical context window, which is what the model claims to support. There is the loadable context window, which is what your runtime can allocate without crashing. There is the usable context window, which is what runs at a tolerable speed. And then there is the reliable context window, which is what still produces correct answers.

Most marketing talks about the first one.

Most users care about the third one.

Agents need the fourth one.

That is why I think the 261K result matters. Not because 17.68 tokens per second is enjoyable. It is not. It matters because the model was still right near the edge of the context window.

That tells me the system has headroom. And headroom is what makes a machine feel dependable.

How local AI benchmarks should change

I think local AI benchmarking needs to mature.

Tokens per second is useful, but it is only the beginning. It tells you how quickly the model speaks under a particular set of conditions. It does not tell you whether the model remains coherent after a long session, whether tool calls still work, whether the runtime handles memory pressure cleanly, or whether the model starts giving plausible but wrong answers once the context becomes heavy.

A better benchmark should test how the system behaves as the prompt grows. It should measure prefill speed, decode speed, wall time, memory use, correctness, stability, and agent behaviour. It should not only ask whether the model responded. It should ask whether the model understood.

This is especially important for local AI because the entire point of local AI is control. If we only measure the first few seconds of a clean prompt, we are not testing the thing we actually want to own. We are testing a demo.

Real work is messy. Real agents accumulate context. Real coding sessions produce logs, errors, patches, reversals, and half-finished ideas. Real research involves conflicting sources and long chains of evidence. Real business workflows require memory of decisions and constraints.

So the benchmark should become messy too.

That is the only way to know whether a local model is a toy or a tool.

Why I still believe local AI matters

The rational argument against local AI is strong.

Cloud AI is better for most people. It is cheaper, easier, more powerful, and constantly updated. For a relatively small monthly subscription, you can access models that would be impossible to run on your own hardware.

So if someone only uses AI casually, I would not tell them to buy a GX10. That would make no sense.

But if AI is becoming part of your work, your business, your research, your coding, your creative process, or your ability to build, then the calculation changes. At that point, local AI is not just about saving money. It is about having a private fallback, a learning environment, an experimentation platform, and a piece of infrastructure you control.

That is why I think of it as digital preparedness.

Not because local AI replaces cloud AI today. It does not.

Because it gives you something that cloud AI cannot give you: ownership of the machine, ownership of the stack, and the ability to keep working even when the cloud changes around you.

That matters to me.

What I learned

Before running this test, I was not completely sure I had made the right choice. The MacBook numbers were impressive, and the convenience of having all that power in a laptop is hard to ignore.

But after pushing Gemma 4 26B-A4B to 261,537 tokens on the GX10 and still getting the correct answer, I feel differently.

The MacBook gave the better first impression.

The GX10 gave me more confidence under pressure.

That is the distinction I care about.

The next phase of local AI will not only be about chatting with a model. It will be about agents that read codebases, manage projects, call tools, remember decisions, debug problems, and keep working across long sessions. Those workflows do not live comfortably at 4K or 8K tokens. They burn through context, expose memory problems, punish weak runtimes, and turn theoretical context into real context.

So my conclusion is simple: do not judge local AI only by how fast it starts. Judge it by how long it stays useful.

On my GX10, Gemma 4 26B-A4B was slow near the limit, but still right.

And in local AI, that may be the difference between a toy and a tool.

If you enjoyed this article, join our community on Discord. https://discord.gg/MRESQnf4R4

Transparency note: This article was written and reasoned by Manolo Remiddi. The Resonant Augmentor (AI) assisted with research, editing and clarity. The image was also AI-generated.

The Augmented Mind: Think with AI

Discussion about this post

Ready for more?