Any idea why the Mac with 128 GB of memory, same as GX1O, would struggle with the larger context window? Is the model implementation somehow different for the MLX environment versus Cuda?
Yes, I think the key point is that “128 GB unified memory” does not mean the two systems handle long context the same way.
The Mac is not weak here. Apple’s M5 Max has up to 614 GB/s memory bandwidth, while DGX Spark/GX10 is listed around 273 GB/s. So this is probably not just a raw memory-bandwidth issue.
The bigger difference is the inference stack.
Gemma 4 26B-A4B has a 256K context window, but at large context the KV cache becomes the hard part. On the Spark/GX10 side, the successful setups use CUDA/vLLM with things like NVFP4 weights, FP8 KV cache, and MoE-specific backend support. That stack seems better tuned for this exact long-context workload. At the moment on my GX10 I'm using llama.cpp which seems to be even more stable that vLLM.
On Mac, the MLX path may run very fast at shorter context, but model format, KV-cache handling, attention implementation, and frontend/server behaviour can differ a lot.
So my interpretation is: the Mac wins short-context speed, but the GX10/Spark stack currently appears more reliable for Gemma 4 26B-A4B at very large context.
Software will get better, oMLX is also something worth checking but I can already tell you that the problem isn't solved there ether.
Any idea why the Mac with 128 GB of memory, same as GX1O, would struggle with the larger context window? Is the model implementation somehow different for the MLX environment versus Cuda?
Yes, I think the key point is that “128 GB unified memory” does not mean the two systems handle long context the same way.
The Mac is not weak here. Apple’s M5 Max has up to 614 GB/s memory bandwidth, while DGX Spark/GX10 is listed around 273 GB/s. So this is probably not just a raw memory-bandwidth issue.
The bigger difference is the inference stack.
Gemma 4 26B-A4B has a 256K context window, but at large context the KV cache becomes the hard part. On the Spark/GX10 side, the successful setups use CUDA/vLLM with things like NVFP4 weights, FP8 KV cache, and MoE-specific backend support. That stack seems better tuned for this exact long-context workload. At the moment on my GX10 I'm using llama.cpp which seems to be even more stable that vLLM.
On Mac, the MLX path may run very fast at shorter context, but model format, KV-cache handling, attention implementation, and frontend/server behaviour can differ a lot.
So my interpretation is: the Mac wins short-context speed, but the GX10/Spark stack currently appears more reliable for Gemma 4 26B-A4B at very large context.
Software will get better, oMLX is also something worth checking but I can already tell you that the problem isn't solved there ether.