The summer of 2024 in Shanghai was a stiflingly humid rainy season, but for the domestic AI industry, it marked the arrival of an event as electrifying as a rock concert: the World Artificial Intelligence Conference (WAIC).
At the conference, Alibaba’s Tongyi Qianwen, Zhipu AI’s foundational large model, and SenseTime’s Vimi controllable character video generation model—each a “crown jewel” of their respective companies—captivated audiences by showcasing their impressive AGI capabilities. But beyond these front-end AI models that dazzled the audience, the backbone of domestic AI—the computing power provided by homegrown chips—also took center stage. Numerous Chinese AI chip companies, including Biren Technology, Moore Threads, Enflame Technology, NationalChip, Tianshu Zhixin, and others, showcased their product lines in quick succession. These offerings covered everything from training to inference, spanned both edge and cloud, and included general-purpose GPUs as well as AI acceleration cards tailored for various scenarios. It was clear that they were aiming to give Nvidia a bit of a “small shock from China.”
As attendees wandered through the domestic computing power pavilion, observing the comprehensive product lines and impressive single-card performance, the intuitive feeling was that this industry was vibrant and, compared to its counterparts across the ocean, not lacking in much. However, each day, as news reports showed the tightening sanctions from the United States, it became difficult to support the conclusion that “domestic AI chips” were in a period of explosive growth. The current "prosperity" begs the question: is there a solid foundation behind it?
If looking back at the urgent history of China’s chip self-sufficiency, from the early attempts of Loongson and Feiteng to challenge the Wintel alliance, to the battle of Kirin chips at Songshan Lake, the industry’s focus has always been on the core processors of devices. As a result, CPU-type cores have been seen by both the government and investors as the key breakthrough point, attracting government procurement orders and substantial funding. The return of Kirin at the end of 2023 was a powerful response by domestic chips to external blockades.
However, while breakthroughs were being made on the processor front, the GPU, once considered a side battlefield, suddenly became the main stage. With the emergence of AI large models in 2023, the demand for GPUs, which power these models, skyrocketed. Nvidia’s revenue grew by 125% in 2023, and its first-quarter report for 2024 showed a staggering 262% growth, leaving other chip giants far behind.
In comparison, the combined market value of Intel, the chip king of the computer era, and Qualcomm, the king of the mobile internet era, is just over $300 billion, less than one-eighth of Nvidia’s. The new king has been crowned by the surging demand for AI training.
But the awkward truth is that in this AI wave, the U.S. does not plan to give China a first-class ticket. Under the demands of the U.S. government, Nvidia and AMD are only allowed to supply lower-end products like the H20 “China-specific version” GPU, while cutting off access to high-end models like the A100 and H100. The “China-specific version,” compared to the “genuine” product, feels somewhat like a stopgap. According to some tech media evaluations, the comprehensive computing power of the H20 is only equivalent to 20% of the H100, and due to additional hardware configurations, the cost of computing power has significantly increased.
In this semi-blockaded situation, the collaboration between domestic large models and domestic AI chips naturally becomes a logical step. Additionally, China’s strong demand for computing power centers offers a huge market for domestic GPUs. By the end of 2023, China’s total data center rack count exceeded 8.1 million, with a total computing power of 230 EFlops, making it the second-largest computing power nation after the United States.
Thus, there are tangible examples of domestic chips being deployed in data centers:
- Biren Technology has become a computing power partner of China Telecom. At the same time, China Mobile’s intelligent computing center (Hohhot), equipped with Biren’s general-purpose GPU computing power products, has recently gone online. This intelligent computing center is a national-level N-node large-scale training ground, with a single computing power of 6.7 EFLOPS (FP16), also validating Biren’s claim that its products can be used for the construction of thousand-card clusters and are scalable to tens of thousands of cards.
- Moore Threads, for its AI flagship product Kuai'e (KUAE), has launched a complete solution, including the Kuai'e Cluster Management Platform (KUAE Platform) and the Kuai'e Large Model Service Platform (KUAE ModelStudio). These address the challenges of how to maintain stable operation and efficient computing resource allocation in large-scale data centers with many interconnected high-performance computing cards. The company has also signed agreements for large-scale cluster projects in Qinghai Zero Carbon Industrial Park, Qinghai Plateau, and Guangxi ASEAN.
Besides the cloud side, matching the demands of AI large models on the edge side is also a focus for many AI chip companies. For example, Enflame Technology has partnered with Zhipu AI to launch a large model programming assistant appliance, based on the CloudFlame i20 inference acceleration card, which can provide AIGC functions (such as code generation, code translation, code annotation, code completion, and intelligent Q&A) for software development enterprises. Muxi Technology has used its N100 card in collaboration with Mouro AI to release the first AI model "texture super-resolution" technology. Kunlun Chips, under Baidu, has further optimized its hardware to support Wenxin Yiyan.
Another significant factor is that the domestic capital market has provided substantial support for the GPU industry’s development. For instance, at the end of 2023, both Moore Threads and Biren Technology completed single rounds of financing exceeding 2 billion yuan, while Muxi Technology easily raised 1 billion yuan in a single round. This is nothing short of a miracle in a capital market that had hit rock bottom in 2023.
In short, domestic computing power centers and large models are being supported from both hardware compatibility and software ecosystems. This gives domestic GPU players the confidence to face international giants. But is everything as smooth as it seems?
Behind the noise of press releases, both computing power centers and large model companies are scrambling to secure Nvidia GPUs. In 2023 alone, Nvidia’s revenue in China reached 80.6 billion yuan, while the performance of domestic GPUs was rather dismal.
The first domestic GPU stock on the A-share market, Jingjiawei, achieved only 108 million yuan in revenue in the first quarter of 2024, a year-on-year increase of 66.27%, but still less than a third of 2022’s revenue. The "champion of computing power," Cambrian, which has long been reported to outperform Nvidia, had a first-quarter revenue of just 25 million yuan, a year-on-year decline of 65%. CloudWalk, which transitioned from AI applications to AI chips, reported total chip revenue of only 24 million yuan in 2023.
The revenue of chip companies in the primary market is even more opaque. For example, Tianshu Zhixin, which claims to be China’s first truly mass-produced general-purpose GPU company, disclosed that its total revenue for 2022 was 250 million yuan. Some companies, whose valuations are approaching tens of billions, or even hundreds of billions, are announcing partnerships and order agreements daily, yet their actual revenue from delivered products barely reaches tens of millions.
It can be said that under all the hype, most “strategic partnerships” and “strategic agreements” are more demonstrative than substantial.
Chinese industry must admit a reality: simply using paper specs to compete with Nvidia is not very meaningful. The stable, continuous, and efficient operation of billion-parameter large models and tens of thousands of card data centers has never been a single-dimensional task, nor can it be achieved overnight.
In fact, even the simplest large model evaluation involves at least five dimensions:
- Single-card performance
- Inter-card connectivity
- Cluster utilization
- Support for large model training
- Compatibility with existing ecosystems
For each domestic GPU, there may be individual highlights. For example, Huawei’s single-card performance may not be inferior to Nvidia’s, Muxi’s graphics cards have performed above industry expectations in terms of compatibility with existing ecosystems, and Baidu’s Kunlun chips show significant advantages in supporting large model training like Wenxin Yiyan. However, to become a five-sided warrior, only Nvidia has achieved this, and any domestic competitor lacking in one aspect finds it hard to land deals.
For instance, one of Nvidia’s recognized moats is its CUDA ecosystem. Without CUDA, most programmers wouldn’t even know how to develop on a GPU hardware platform, as its software ecosystem has permeated all aspects of AI and scientific research. Baidu’s former chief scientist, Andrew Ng, once commented: before CUDA, there were probably fewer than 100 people in the world who could program on GPUs, but now there are millions of CUDA developers globally.
This is thanks to Nvidia’s early support for the CUDA system’s development and promotion in the AI field starting in 2006. At the time, Nvidia invested $500 million annually in R&D to continuously update and maintain CUDA, when its annual revenue was only a modest $3 billion. Concurrently, Nvidia made CUDA available to universities and research institutions in the U.S. for free, quickly establishing CUDA’s dominance in the AI and general computing fields.
In the field of large model support, Nvidia has long been ahead of everyone else. Few people know that after Nvidia spent a huge sum to build the world’s first AI supercomputer, DGX-1, in 2016, it immediately donated it to OpenAI, then still in its infancy, establishing deep ties with the large model ecosystem early on.
In the area of high-performance computing interconnects, NVLink is far ahead of its competitors, leaving even AMD, another American graphics card giant, in awe. A well-known fact is that GPU computing power doesn’t simply add up; if there isn’t good interconnect technology, 1+1 might not even equal 2, and it’s questionable whether 10+10 can even reach 15.
While other manufacturers are still limited by traditional PCIe, Nvidia has been planning for over 10 years. As early as 2014, Nvidia released NVLink 1.0, achieving five times the transmission speed of PCIe 3 between P100 GPU chips at that time. In 2020, Nvidia completed the acquisition of Mellanox, gaining capabilities in InfiniBand, Ethernet, SmartNIC, DPU, and LinkX interconnects, further strengthening its position. Today, NVLink can achieve a bandwidth of up to 600GB per second between each GPU, ten times higher than PCIe 4.0.
Therefore, in the eyes of some critics, Nvidia is like a “three-headed dragon,” with powerful GPU computing power, a rich software ecosystem, and high-speed interconnect bandwidth, creating a well-defended, difficult-to-penetrate product line. Attempting to bypass its ecosystem could lead to situations where you have thousands of cards but can only achieve the performance of hundreds, or where you face challenges in finding suitable application development tools halfway through programming. This kind of loss is unacceptable for AI computing centers with massive investments, and it’s also an unbearable pain for large model developers facing extensive engineering and optimization work.
What’s even more striking is that Nvidia is still racing down the path of cost reduction for customers.
Jensen Huang has a famous saying to customers: “The more you buy, the more you save.” This is known as Huang’s math. In the context of AI large models, it’s about how to achieve cost reduction in hardware for large model training and token generation. In June this year, Nvidia’s GB100 chip based on the Blackwell architecture reduced costs and energy consumption to 1/25th of the H100, and in the GPT-3 LLM benchmark with 175 billion parameters, the GB200’s performance reached seven times that of the H100, with a training speed four times faster, selling the chip at an impressive price-performance ratio despite its $70,000 price tag.
It can be said that building a billion-parameter large model and a ten-thousand-card data center on hardware ecosystems and communication interconnects that haven’t been tested by time and case studies is like constructing a skyscraper without surveying the foundation’s terrain. Trying to fully support domestic large models with domestic GPUs would drive the costs of domestic large model companies to unsustainable levels.
Therefore, the loud noise but limited actual implementation has become a somewhat helpless industry reality.
The idea of “quick victory” is unwise, but neither does China have to slip into a narrative of “quick defeat.” Even as formidable as Nvidia is, it cannot win every battle.
Keep reading with a 7-day free trial
Subscribe to Geopolitechs to keep reading this post and get 7 days of free access to the full post archives.