Huawei CloudMatrix AI performance has achieved what the company claims is a significant milestone, with internal testing showing its new data centre architecture outperforming Nvidia’s H800 graphics processing units in running DeepSeek’s advanced R1 artificial intelligence model, according to a comprehensivetechnical paperreleased this week by Huawei researchers.
The research, conducted by Huawei Technologies in collaboration with Chinese AI infrastructure startup SiliconFlow, provides what appears to be the first detailed public disclosure of performance metrics for CloudMatrix384.
However, it’s important to note that the benchmarks were conducted by Huawei on its systems, raising questions about independent verification of the claimed performance advantages over established industry standards.
The paper describes CloudMatrix384 as a “next-generation AI datacentre architecture that embodies Huawei’s vision for reshaping the foundation of AI infrastructure.” While the technical achievements outlined appear impressive, the lack of third-party validation means results should be viewed in the context of Huawei’s continuing efforts to demonstrate technological competitiveness outside of US sanctions.
The CloudMatrix384 architecture
CloudMatrix384 integrates 384 Ascend 910C NPUs and 192 Kunpeng CPUs in a supernode, connected by an ultra-high-bandwidth, low-latency Unified Bus (UB).
Unlike traditional hierarchical designs, a peer-to-peer architecture enables what Huawei calls “direct all-to-all communication,” allowing compute, memory, and network resources to be pooled dynamically and scaled independently.
The system’s design addresses notable challenges in creating modern AI infrastructure, particularly for mixture-of-experts (MoE) architectures and distributed key-value cache access, considered essential for large language model operations.
Performance claims: The numbers in context
The Huawei CloudMatrix AI performance results, while conducted internally, present impressive metrics on the system’s capabilities. To understand the numbers, it’s helpful to think of AI processing like a conversation: the “prefill” phase is when an AI reads and ‘understands’ a question, while the “decode” phase is when it generates its response, word by word.
According to the company’s testing, CloudMatrix-Infer achieves a prefill throughput of 6,688 tokens per second per processing unit, and 1,943 tokens per second when generating a response.
Think of tokens as individual pieces of text – roughly equivalent to words or parts of words that the AI processes. For context, this means the system can process thousands of words per second on each chip.
The “TPOT” measurement (time-per-output-token) of under 50 milliseconds means the system generates each word in its response in less than a twentieth of a second – creating remarkably fast response times.
More significantly, Huawei’s results correspond to what it claims are superior efficiency ratings compared with competing systems. The company measures this through “compute efficiency” – essentially, how much useful work each chip accomplishes relative to its theoretical maximum processing power.
Huawei claims its system achieves 4.45 tokens per second per TFLOPS for reading questions and 1.29 tokens per second per TFLOPS for generating answers. In perspective, TFLOPS (trillion floating-point operations per second) measures raw computational power – akin to the horsepower rating of a car.
Huawei’s efficiency claims suggest its system does more useful AI work per unit of computational horsepower than Nvidia’s competing H100 and H800 processors.
The company reports maintaining 538 tokens per second under the stricter timing requirements of sub-15 milliseconds per word.
However, the impressive numbers lack independent verification from third-parties, standard practice for validating performance claims in the technology industry.
Technical innovations behind the claims
The reported Huawei CloudMatrix AI performance metrics stem from several technical details quoted in the research paper. The system implements what Huawei describes as a “peer-to-peer serving architecture” that disaggregates the inference workflow into three subsystems: prefill, decode, and caching, enabling each component to scale based on workload demands.
The paper posits three innovations: a peer-to-peer serving architecture with disaggregated resource pools, large-scale expert parallelism supporting up to EP320 configuration where each NPU die hosts one expert, and hardware-aware optimisations including optimised operators, microbatch-based pipelining, and INT8 quantisation.
Geopolitical context and strategic implications
The performance claims emerge against the backdrop of intensifying US-China tech tensions. Huawei founder Ren Zhengfei acknowledged recently that the company’s chips still lag behind US competitors “by a generation,” but said clustering methods can achieve comparable performance to the world’s most advanced systems.
Nvidia CEO Jensen Huang appeared to validate this during a recent CNBC interview, stating: “AI is a parallel problem, so if each one of the computers is not capable… just add more computers… in China, [where] they have plenty of energy, they’ll just use more chips.”
Lead researcher Zuo Pengfei, part of Huawei’s “Genius Youth” program, framed the research’s strategic importance, writing that the paper aims “to build confidence in the domestic technology ecosystem in using Chinese-developed NPUs to outperform Nvidia’s GPUs.”
Questions of verification and industry impact
Beyond the performance metrics, Huawei reports that INT8 quantisation maintains model accuracy comparable to the official DeepSeek-R1 API in 16 benchmarks in internal, unverified tests.
The AI and technology industries will likely await independent verification of Huawei’s CloudMatrix AI performance before drawing definitive conclusions.
Nevertheless, the technical approaches described suggest genuine innovation in AI infrastructure design, offering insights for the industry, regardless of the specific performance numbers.
Huawei’s claims – whether validated or not – highlight the intensity of competition in AI hardware and the varying approaches companies take to achieve computational efficiency.
(Photo by Shutterstock )
See also: From cloud to collaboration: Huawei maps out AI future in APAC
Want to learn more about cybersecurity and the cloud from industry leaders? Check out Cyber Security & Cloud Expo taking place in Amsterdam, California, and London.
Explore other upcoming enterprise technology events and webinars powered by TechForge here.
Source link
#Huawei #CloudMatrix #performance #beat #Nvidia #internal #tests