Chinese academician: China has the potential to surpass the United States in the field of LLM plus

For a 10,000-card system to be considered “good,” it must be widely regarded as useful once built.

Dec 17, 2024

On December 12, the "Breaking New Frontiers, Embracing a Smart Future" 2024 Large-Scale Model Technology and Application Innovation Forum was held in Beijing. During the event, Zheng Weimin, an academician of the Chinese Academy of Engineering and a professor in the Department of Computer Science at Tsinghua University, delivered a speech titled “Research and Reflections on Computer Systems Supporting Large Model Training and Inference.” In his speech, he stated that China has the potential to surpass the United States in the “Large Model+” arena.

He noted that the current difficulties in large model development lie in computing power, storage, and time costs. Although building a domestic “10,000-card system” (a high-performance computing system consisting of 10,000 or more accelerator cards, including GPUs, TPUs, and other specialized AI chips) is indeed important, there are still challenges and the “weakest-link effect” (or “bucket effect”) must be avoided. He cautioned that companies with limited funds should not consider it for now, while those with ample funds might give it a try.

“Large Model+” May Surpass the United States

Zheng pointed out two characteristics of large model development this year. First, foundational large models have become multimodal, extending beyond text to include images and videos, thus achieving multimodality. Second, these models are now being put into practical use, i.e., “Large Model+” is being applied across various sectors such as finance, healthcare, automotive, and intelligent manufacturing.

He believes that while China's foundational models may slightly lag behind those in the United States, in the realm of “Large Model+” (applying these models to different fields), China still has the hope of surpassing the U.S.

Challenges in Large Model Development

Zheng outlined five stages in the large model lifecycle—data acquisition, data preprocessing, model training, model fine-tuning, and model inference—and used them to illustrate the current challenges in large model development.

Data Acquisition

The core task is to collect training data from all over the world. Although each file may be small, the total quantity reaches tens of billions. These files must be stored on hard drives and their locations recorded, a process known as source data processing.
Because of the enormous number of files, multiple computers must work together to store and remember these locations, which is a challenge. As the number of file locations increases, retrieving specific files becomes more time-consuming, making efficient data storage and retrieval a key issue at this stage.

Data Preprocessing

The collected data often varies in quality and format, containing advertisements, duplicates, and other unnecessary information. Preprocessing is needed to improve data quality by removing duplicates, ads, etc., ensuring that better data quality leads to improved training results.

Preprocessing is extremely complex. It is reported that for GPT-4, preprocessing took up half of the total training time, becoming a bottleneck. How to speed up preprocessing is a major challenge in big data processing.

Model Training

Model training requires massive computing power and storage space to produce a foundational large model. Numerous problems arise during this phase. For example, if hardware fails during training, the process must start over. To avoid this, one might pause training at certain intervals to record the current hardware and software environment so that if a failure occurs, training can resume from the recorded state rather than from scratch.

However, for large models, saving data to disk can take hours, leading to inefficiency. The challenge is how to shorten this saving process to 10–20 minutes to improve overall training efficiency.

Model Fine-Tuning

Even after the foundational model is trained, it must be further trained (fine-tuned) for specific domains, such as healthcare. Fine-tuning uses domain-specific data to adapt the foundational model to specialized needs.
For example, if the foundational model lacks sufficient data for hospitals, it can be fine-tuned with additional hospital data. This can be further refined, such as a third stage of training specifically for ultrasound data, resulting in specialized “domain-specific” or “industry-specific” large models.

Model Inference

The inference stage is where the model is actually applied. Inference also requires significant computing power, storage, and time. Throughout the entire development process, considerations about computing power, storage, and time costs are constant.

Avoiding the “Bucket Effect” When Building Domestic 10,000-Card Systems

In terms of industry expectations, Academician Zheng emphasized that building a domestic 10,000-card system is crucial. However, current training effectiveness with remote and heterogeneous cards is poor. He advised companies with limited funds not to consider it for now, while companies with ample resources could try.

First, while the importance of a 10,000-card system is self-evident, achieving it is extremely challenging. Due to external supply constraints, establishing a homegrown 10,000-card system is urgently needed but is a difficult task. For a 10,000-card system to be considered “good,” it must be widely regarded as useful once built, a highly challenging goal.

Currently, how do users accept these card systems? For example, if the first card comes from Company A, the second from Company B, and the third from Company C, when used together, the overall performance depends on the weakest card. Zheng suggests reducing the number of cards and conducting more in-depth research to avoid the “bucket effect.” For instance, mixing 1,000 old CPUs and 1,000 new CPUs might result in worse performance than using 2,000 old CPUs alone, so why do it?

Second, remote and heterogeneous card training currently yields poor results and is not recommended. Combining different types of accelerator cards is complicated and not cost-effective. Even under static conditions, neither Chinese nor American developers do this.

As for joint training across different geographical locations, it becomes even harder. For example, transferring data from Beijing to Guizhou might take five days, and after processing, sending it from Guizhou to Shanghai might take another five days. This makes remote card training impractical.

Hence, for companies with limited funds, it’s not advisable to consider these solutions for now. Those with ample funding may attempt them.

Geopolitechs

Discussion about this post