Oh my goodness, this model feels like it was crafted specifically for Agents!

At 5 a.m., I glanced at my phone and discovered that Qwen3 had just been released. I thought about going back to sleep, but the excitement kept me awake. I knew I had to dive in and test it all day long.

When I first laid eyes on Qwen3-30B-A3B, one thought immediately struck me: this could very well be a model designed with Agents in mind.

Currently, there are two significant challenges when it comes to implementing Agents: continuous tool invocation and managing token consumption alongside speed.

I started by testing its throughput, and the results were jaw-dropping! I used the FP8 version for these tests. As shown in Figure 2, I conducted 16 and 32 concurrent tests, simulating typical Agent scenarios with lengthy system prompts. Using an input of 8192 and setting the output to 1024, it achieved an impressive output throughput of 388t/s.

Keep in mind, this performance was achieved on just a single 4090 GPU. It seems that even a small company’s internal Agent requirements could be met with just one or two 4090s using this model.
Next, I tested it under my own specific scenario. My C++ problem-solving assistant is nearly ready for testing. It leverages the MCP protocol to create an interactive notebook. You can pose C++ questions to it, and it generates a step-by-step executable code notebook, which greatly enhances learning efficiency, especially for children.
This particular scenario involves multiple rounds of tool invocation, where the AI continuously operates and executes tools while analyzing the results. The test results were nothing short of perfect!
In each round, the model carefully evaluates the current task status and determines the next steps. Clearly, this model has been trained specifically for multi-round tasks, and I’ve been eagerly waiting for something like this for quite some time.
What truly astonished me, however, was the single-user throughput, which exceeded 100t per second. As you can see in my final image, despite consuming 50,000 tokens in a scenario involving multiple rounds of tool usage, the output speed was lightning-fast, completing the entire process in just 40 seconds.
Could the future look like this: using a compact activation MOE to address both speed and cost concerns, delivering solid performance while tackling the most complex problems in Agents?
The wait was absolutely worth it. In the coming days, I’ll continue conducting in-depth tests on these models based on my C++ learning Agent scenario. Moreover, regarding that 200B+ model, I believe it might be feasible to run it on a single machine with some fine-tuning using Ktransformer. Is anyone interested? If so, I’d be happy to conduct further testing.