huawei ai cloudmatrix 384 china’s answer to nvidia gb200 nvl72 pdf

Huawei Cloud Computing Power Reaches Another Major Breakthrough!

The recently concluded Huawei Connect 2025 conference, a series of new advancements were announced

This is only half a year since the formal release of the CloudMatrix384 super-node in April 2025, during which its capabilities have continued to evolve:

In April 2025, the CloudMatrix384 super-node was released and commercialized, scaling up at the Wuhu data center; in June 2025, the new generation of Ascend AI cloud services based on the CloudMatrix384 super-node were fully launched; in September 2025, the Tokens service was fully integrated with the CloudMatrix384 super-node, effectively shielding the complex underlying technology implementation and directly providing users with the final AI computing results.

Currently, the AI industry is still shrouded in the anxiety of computing power. Silicon Valley giants have been active in the fields of computing power and chips recently:

OpenAI is developing its own AI chip while proposing a $300 billion deal to Oracle for computing power; Musk has built a 10,000-card supercomputing cluster in a hundred days and plans to scale up to a million cards, while also quietly laying out chip plans; Meta, AWS, and other companies are also actively seeking more computing power resources… However, the development of computing power is not achieved overnight. It requires an ultimate breakthrough in single-point technology and involves the collaborative evolution of chips, hardware, architecture, software, network, energy, and the entire industrial ecosystem.

Looking globally, suppliers that can output powerful computing power all rely on decades of sedimentation and accumulation.

Huawei Cloud, as one of the members, has an exploration path that is particularly profound due to the stage of the industry: it not only needs to redefine the rules of computing power operation in the

The main part of the calculation is Kungpeng Cloud business, based on Huawei Cloud’s self-developed Kungpeng processor (ARM architecture), providing a series of cloud service products for general computing scenarios, and promoting industrial intelligent innovation. Kungpeng Cloud has achieved comprehensive innovation in software and hardware collaboration, from multi-core high-concurrency chip design, the integrated ‘Qingtian’ architecture, to the deep optimization of Huawei Cloud’s intelligent scheduling platform and the operating system, Kungpeng Cloud services release the powerful computing power of ‘out-of-the-box’. Currently, the number of Kungpeng cores on the cloud has increased from over 9 million to 15 million, with an increase of 67%. At the same time, it is fully compatible with mainstream application software, having adapted over 25,000 applications, providing solid support for the prosperity of the ARM ecosystem. This is the general architecture of Huawei Cloud’s ‘computing black soil’. Under this system, Huawei Cloud can upgrade more clearly and specifically according to AI landing needs in the era of large models, providing the industry with more efficient, easy-to-use, and reliable computing power. In the AI era, defining computation with Tokens The demand for AI has led Huawei Cloud to officially launch the Tokens service based on the CloudMatrix384 ultra-node this year. This is a cloud service model that is oriented to AI large model inference scenarios and charges based on the actual Token consumption. Different from the traditional cloud computing billing method, it can significantly reduce the cost of AI inference. This model adjustment is based on detailed insights into the landing of large models. Tokens are the conversion of text into digital vectors, and the scale of content throughput of large models is calculated in Tokens, which is the natural unit of measurement in the era of large models. As the AI landing process progresses, the consumption of Tokens has increased exponentially. Data shows that the daily average Token consumption in China in early 2024 was 100 billion, and by the end of June this year, the daily average Token consumption has exceeded 3 trillion, increasing more than 300 times in just over a year and a half. Clearly, Tokens are no longer just a unit of calculation in the technical field, but also the actual consumption of large models, a key reference for measuring the landing of large models, and can also directly reflect the usage of GPU computing power, memory, and computing time behind them. Using Tokens as a billing unit is gradually becoming an industry consensus: on the one hand, it can more accurately calculate the resources used by enterprises, allowing users to pay only for actual consumption, and further optimize costs by understanding the cost composition through actual consumption; on the other hand, it can solve the problem of unfair charging caused by the large gap in Token consumption in different scenarios, providing a reference for cloud vendors to dynamically adjust computing resources. For example, in scenarios such as online, near-line, and offline: long text generation tasks are suitable for daily office scenarios, with high usage during the day and almost silent at night, making pay-as-you-go billing more reasonable than time-based or card-based billing; in scenarios such as intelligent customer service and AI assistants, the conversation rounds and depth of different sub-situations are uncertain, and the Tokens service model can more accurately calculate the cost of each interaction. On the other hand, Token services can effectively shield complex underlying technology implementations. Users do not need to care about the process technology of the chip, the generation of the server, and other complex hardware technology stacks, nor do they need to care about the complex software technology stacks such as inference frameworks and model deployment. They can efficiently obtain the ‘final result of AI’ directly. At the HC2025 conference, Huawei Cloud announced the full launch of the CloudMatrix384 AI Token inference service. This means that AI computing power has entered a new stage characterized by ‘extreme performance and efficiency’, with performance exceeding that of NVIDIA H20 by 3-4 times. The underlying technology mainly relies on the CloudMatrix384 ultra-node and the xDeepServe distributed inference framework. First, the CloudMatrix384 ultra-node uses a fully peer-to-peer interconnection architecture and high-speed communication technology, which has significant advantages in computing and communication, and can release more extreme computing power. The CloudMatrix384 ultra-node uses Huawei Cloud’s self-developed MatrixLink high-speed peer-to-peer interconnection network to tightly couple 384 Ascend NPUs and 192 Kungpeng CPUs, forming a logically unified super ‘AI server’. Through the Scale Out method, it can also be composed into an AI cluster with more than 160,000 cards, supporting the training of 1300 billion parameter large models or the inference of tens of thousands of models. In the future, based on Huawei’s latest AI server planning, the specifications of the CloudMatrix ultra-node will be further upgraded to 8192, forming an AI cluster with a million cards. Second, based on the ‘everything can be pooled’ concept, Huawei Cloud uses the innovative EMS elastic memory storage service to decouple the NPU graphics memory, CPU memory, and storage resources, forming a unified resource pool. The NPU can directly access the pooled memory remotely, realizing independent expansion of the graphics memory, and significantly reducing the delay of multi-round conversation Tokens. At the same time, computing, storage, and network resources can be dynamically combined according to the load requirements, improving resource utilization. This technology has a significant impact on multi-round question and answer scenarios. When large models carry out multi-round question and answer, the response usually becomes slower as the number of rounds increases. The reason is that the large model needs to ‘remember’ the data generated in each round to ensure the continuity of the answer. When the number of rounds of questions and answers increases, the amount of computation doubles, leading to response delay. The EMS service can effectively solve this problem. Third, PDC separation and dynamic PD: PDC (Prefill-Decode-Caching) separation is to solidify Prefill and Decode in their respective clusters, and take advantage of the global addressing feature of the MatrixLink high-speed peer-to-peer interconnection network to establish an independent KV cache cluster. This way, whether it is the Prefill cluster or the Decode cluster of the NPU, it can directly access the shared memory cache of the independent KV cache cluster, breaking through the physical location constraints of the data, significantly improving load balancing, NPU utilization, and memory utilization, while providing greater elasticity. The system can also accurately and in real-time analyze or predict the load of the inference business. Fourth, the CloudMatrix384 ultra-node is specifically designed for mainstream MoE architecture, supporting the distributed inference mode of ‘one card, one expert’, distributing the expert modules of the MoE model to different NPU cards for parallel processing. For example, 256 experts correspond to 256 cards, reducing communication delay and power waste caused by communication delay, reducing the delay of loading weights on each card, and reducing the memory occupation of weights, significantly improving the number of parallel paths per card. When the Tokens service is fully connected to the CloudMatrix384 ultra-node, enterprise users can obtain the ‘final AI computing result’ required for various industries with optimal performance, good service, and high quality, allowing them to focus more on application and business innovation.

Choose a language:

huawei ai cloudmatrix 384 china’s answer to nvidia gb200 nvl72 pdf

By WMCN