Founder of Moonshot AI on OpenAI's o1 model and its new paradigm

Sep 15, 2024

The release of OpenAI's o1 has once again sparked discussions within the industry about the new paradigm of large model evolution.

The discussion focuses on two widely recognized bottlenecks in the evolution of large models: the data bottleneck—the fact that there's a shortage of data—and the compute bottleneck—with 32,000 GPUs currently being the upper limit.

However, the o1 model appears to have found a new path. By employing reinforcement learning, it seeks to overcome these limitations through deeper thinking and reasoning, aiming to improve both data quality and computational efficiency.

Yang Zhilin, the founder of Moonshot AI, has shared some profound insights on whether this new paradigm could propel large model competition into a new phase.

On September 14th, Yang Zhilin gave a talk at the Xuanyuan College of Tianjin University. His key points were:

After the scaling law, the next paradigm for large model development is reinforcement learning.
The release of the OpenAI o1 model attempts to break through the data wall via reinforcement learning, with a growing trend of shifting more computation toward reasoning.
The upper limit of this generation of AI technology is determined by the upper limit of text model capabilities.
In the AI era, the capability of a product is determined by the capability of its model, which is fundamentally different from the internet era—if the model isn't strong, the product experience will not be good.
The super app of the AI era will most likely be an AI assistant.

At the end of the talk, Yang Zhilin quoted “Thinking, Fast and Slow” author Daniel Kahneman, saying:

"Many times, you have the courage to attempt something unknown because you don't realize how much you don't know. Once you try, you discover many new problems, and perhaps that's the essence of innovation."

Full transcript of Zhilin’s speech:

Today, I mainly want to share some thoughts on the development of the artificial intelligence (AI) industry.

The AI field has been developing for over seventy years and has gone through various stages of progress. From 2000 to 2020, AI was primarily focused on vertical domains, and many companies emerged in areas like facial recognition and autonomous driving. However, these companies were largely focused on specific tasks, developing AI systems tailored for particular purposes.

These systems were labor-intensive and highly customized. This was the core paradigm of AI at that time: "You reap what you sow. If you want a watermelon, you plant a watermelon; you can't expect to plant one thing and get something else."

This paradigm has changed significantly in recent years. The focus is no longer on training highly specialized AI models but on developing general intelligence.

What are the benefits of general intelligence? The same model can be applied across different industries and tasks, making it highly generalizable, which opens up vast potential.

If we can achieve human-level performance in many areas, it may have a leveraging effect on the global economy, potentially boosting GDP. This is because individual productivity could increase dramatically. A task that once required one person's output might now be handled by general AI, resulting in a productivity multiplier, possibly doubling or even increasing tenfold, depending on how far general intelligence evolves.

The Three Factors Behind the Emergence of General Models

Why have general models emerged so suddenly in recent years? I believe it is both an inevitability and a coincidence. It is inevitable in the sense that the advancement of human technology was bound to reach this point eventually.

However, it is coincidental because three specific factors came together:

**First**, the internet has been developing for over 20 years, providing a massive amount of training data for AI. The internet essentially digitized the world and human thoughts, allowing everyone to generate data. The ideas in people's minds eventually became vast amounts of data.

This is quite a coincidence because when people started building internet products like search engines or portal websites in the early 2000s, they probably never imagined that this data would one day contribute to the next generation of technological advancements for human civilization. In the grand scheme of technological development, the internet serves as a prerequisite node for AI.

**Second**, many advancements in computer technology also serve as prerequisites for AI. For example, achieving 10^25 FLOPs (floating-point operations) is necessary to develop sufficiently intelligent models.

However, performing such a massive number of operations simultaneously within a single cluster in a manageable time frame was not possible ten years ago.

This progress depended on advancements in chip technology and network technology. Not only did chips need to become faster, but they also needed to be interconnected, with sufficient bandwidth and storage. All of these technologies had to come together to enable computations of 10^25 FLOPs within two to three months.

If it took two or three years to reach 10^25 FLOPs, the current models would not have been feasible. The long iteration cycles would require waiting years after each failed training attempt, limiting models to one or two orders of magnitude fewer operations. However, with fewer operations, it would not have been possible to achieve the level of intelligence we see today. This is what scaling laws dictate.

**Third** is the improvement in algorithms. The Transformer architecture, invented in 2017, initially started as a translation model, a somewhat specialized concept. Later, many people expanded it into a more general architecture, and it was discovered that Transformers are highly versatile. No matter the type of data or what needs to be learned, as long as it can be expressed digitally, the Transformer can learn it. Moreover, this versatility scales exceptionally well.

With more traditional structures, such as recurrent neural networks (RNNs) or convolutional neural networks (CNNs), adding more parameters—beyond a billion or so—would not lead to improvements. However, with the Transformer, adding more parameters continually enhances performance, with no apparent upper limit. This structure has made general learning possible. All you need to do is keep feeding data into the model and define the objective function for what you want it to learn.

These three elements—data, computation, and algorithms—combine to give rise to the general models we see today, and all of them are indispensable.

It’s fascinating to observe that the development of human technology always builds upon the shoulders of those who came before.

There’s a highly recommended book called *The Nature of Technology*, which discusses how technological progress is fundamentally a process of combinatorial evolution. Each generation of technology can be seen as a combination of several previous generations. However, certain combinations yield far greater power than others. For example, the three factors we just discussed—the combination of data, computation, and algorithms—are incredibly powerful, leading to the creation of general models. Yet, before OpenAI, perhaps no one imagined that combining these three elements could produce such tremendous results.

The Three Levels of Challenges for AGI

Building on the three factors mentioned earlier, I believe that for artificial general intelligence (AGI), there are three key levels of challenges:

1. Scaling Law: At the most fundamental level, we have the scaling law. This is the first layer of innovation opportunities, discovered and perfected by OpenAI. The scaling law shows that as you increase the model size and computational power, the model's performance improves significantly.

2. Unified Representation: The second level of innovation opportunities revolves around some unresolved issues within the scaling law framework, such as how to integrate all modalities (e.g., text, images, sound) into a unified representation within the same model. This is a key challenge at the second level.

Additionally, despite the internet's 20-plus years of development, the amount of available data is finite. We are now encountering a "data wall," where there is no longer enough data to continue training AI models effectively.

For example, if we want to create an AI with excellent mathematical abilities, we need to consider what data would help it learn these skills. However, there are relatively few digitized math problems available, and most data on the internet is unrelated to mathematics.

The best available data has already been used extensively, and it's difficult for any individual or company to find a dataset ten times larger than what the internet offers. This data wall is a significant challenge. If we can overcome it, the second level of opportunities and gains will become accessible.

3. Reasoning and Instruction-Following: The third level of challenges involves tasks requiring longer contexts, stronger reasoning abilities, or more advanced instruction-following. These are the challenges at the third level.

At the foundational level, we have **first principles**. Once first principles are established, they represent a clear distinction between "0 and 1." However, above first principles, there are many second-level issues—such as core technologies—that still need to be resolved. Many researchers are working on these second-level technical challenges, and if they succeed, these innovations can take technology from being merely feasible to becoming highly practical and scalable.

When looking at the development of the steam engine, we see the same pattern: once the fundamental theory was established, the first principles were clear. But during the process of implementing the steam engine, the initial power output was insufficient, or the cost was too high. Almost all new technologies face these two challenges when they first emerge.

A key problem we discussed earlier is the **data wall**. Based on first principles, we need to continuously train larger models and feed them more data, but this creates a conflict. Natural data has been exhausted, so how do we add more data to sustain the scaling process? This requires a **paradigm shift**.

Previously, tasks were relatively simple—merely predicting the next token. However, this process inherently involves a lot of reasoning and knowledge.

For instance, if the input sentence is "The closest direct-controlled municipality to Beijing is Tianjin," the language model uses the previous words as input to predict the last word (Tianjin or Chongqing). After making multiple predictions, the model learns that the correct answer is Tianjin. Through this prediction process, the model acquires knowledge.

Another example would be reading a detective novel. After reading nine chapters, the model predicts who the killer is in the final chapter. This is similar to predicting the next token. If the sentence is complex, and after much reasoning the model concludes that a particular character is the killer, it has effectively learned reasoning.

If there is a large amount of such data, the model will learn reasoning skills. It will acquire not only reasoning but also knowledge and many other tasks. If we continue feeding the model all available data and have it predict the next token, its intelligence will increase, its reasoning will improve, and its knowledge will expand.

There are three types of things that the model can learn:

1. Low-Entropy Knowledge: In cases of low entropy, factual knowledge or information has very low entropy (uncertainty). The model can simply memorize this knowledge.

2. Reasoning Process: For tasks like reasoning in detective novels, there may be multiple reasoning paths with medium entropy, but the end result remains the same.

3. High-Entropy Creative Tasks: For creative tasks, such as writing a novel, there is high uncertainty (entropy), and the outcome is not deterministic.

All of these different types of tasks can be learned under a single framework by predicting the next token. By focusing on this single objective, the model can learn these diverse tasks. Moreover, it doesn’t need to choose whether it’s learning from Xiaohongshu or Wikipedia—this is the foundation of **general intelligence**. By putting everything into one model, it can learn without needing to differentiate between different sources. This is what makes general intelligence truly versatile.

The Release of OpenAI’s o1 Marks the Emergence of a New Paradigm

The next paradigm is driven by **reinforcement learning**. Why do we need reinforcement learning? As mentioned earlier, natural data is running out, and OpenAI’s upcoming release of “o1” symbolizes the shift from the previous paradigm to a new one. In the earlier paradigm, there was a reliance on natural data for predicting the next token, but as we’ve discussed, there are only so many math problems available in the world. So, how do we improve AI’s mathematical abilities?

The solution is to continuously generate new problems, have the AI solve them, and then learn from the correct and incorrect answers. This iterative process of improving based on feedback is the essence of **reinforcement learning**.

This new paradigm differs from the previous one. Earlier, models would rely on natural data to predict the next word in a sequence. Now, after having built a strong foundational model, the AI can essentially "play with itself," generating a vast amount of data, learning from the good outcomes, and discarding the bad ones. In doing so, it can create data that didn’t exist before.

For example, with o1, you will notice that it generates many thoughts during problem-solving. What’s the purpose of these generated thoughts? The core idea is that this process **creates data**. In the real world, some types of data don’t naturally exist. For instance, if a brilliant mathematician proves a new theorem or solves a complex math problem, or even competes in a contest, they only write down the solution, not the thought process behind it. So, this type of data doesn’t naturally exist.

However, we now want AI to generate these thought processes itself and then learn from them to improve generalization. For example, if you give a student a challenging problem, simply showing the solution might not be enough. The student needs someone to explain the thought process behind the solution—how the solution was reached, why a particular approach was used. By learning the thought process, the student will be able to solve similar problems in the future, even if they are slightly different.

If the student only learns the final answer, they will only be able to solve that exact type of problem. For example, they could memorize how to solve quadratic equations and apply the same method every time, but they wouldn’t understand the reasoning behind it. If the student learns the thought process, it’s like having a great teacher who constantly guides them on how to think. This will significantly enhance their generalization ability and generate new data that didn’t naturally exist before. This **data generation process** allows scaling to continue.

Moreover, the way scaling happens is also changing. Previously, most of the scaling occurred during the training phase, where large datasets were used to train the model. But now, more and more computation is shifting to the **inference phase**. Since the model now needs to think, this reasoning process itself requires computational power, and it can also be scaled. This makes sense, as complex tasks require more time. For example, you wouldn’t expect someone to prove the Riemann hypothesis in a second or two—it might take them years to solve.

The next key challenge is how to define increasingly complex tasks. For these more complex tasks, the way models interact with humans may also change. Instead of completely synchronous interaction, it could shift to an asynchronous mode, where the model is allowed to spend time gathering information, thinking, analyzing, and then delivering a report. This would enable the model to complete more sophisticated tasks by combining the scaling law with **reinforcement learning** during the inference phase.

The Upper Limit of This Generation of AI Technology Is Fundamentally Tied to the Capabilities of Text Models

I believe that the core determinant of the upper limit of this generation of AI technology is the capabilities of text models. If text models can continue to improve in intelligence, they will be able to handle increasingly complex tasks. It’s akin to the process of learning—initially capable of solving elementary school problems, then progressing to high school, university, and now even possessing knowledge and reasoning skills at the level of a PhD.

As text models continue to improve, the ceiling for this generation of AI technology will rise. In my view, the upper limit of the value of AI technology is closely tied to the ability of text models, making the continuous improvement of these models crucial. As long as the scaling law remains valid, we can expect ongoing advancements.

On one axis, we are seeing the incorporation of more modalities, as discussions around "multimodal models" have gained traction. For instance, there is input and output in visual formats, as well as audio input and output. These modalities may even be converted between each other. Consider an example where a product's requirements are illustrated through a drawing, and the system can directly convert that into code, which in turn can generate a video to be used as a landing page. This is a cross-modal task that current AI is not yet fully capable of. However, in one or two years, we could see significant integration of these modalities.

Ultimately, how well these modalities integrate will depend on the "brain"—in other words, the strength of the text model. This is because the model needs to handle complex planning. If the results of the second step are different from what was anticipated in the first step, the model must adjust dynamically. It may decide not to proceed with the third step as originally planned and opt for a different approach. This requires a high degree of reasoning and planning ability, which are fundamentally constrained by the upper limit of the text model.

We can think of these advancements along two axes:

1. Horizontal Expansion (Multimodality): This refers to the model's ability to handle a broader range of tasks across different modalities (e.g., vision, audio, text). The more modalities the AI can manage, the more tasks it can take on.

2. Vertical Development (Text Model Intelligence): This refers to how smart the AI becomes. The smarter the text model is, the more complex tasks it can handle. For example, even if an AI is highly intelligent but lacks vision (one of the modalities), its ability to perform tasks will be limited. Both dimensions—horizontal and vertical—are crucial and will likely see simultaneous improvement in the next two to three years. If both improve together, we will be on the path toward **AGI (Artificial General Intelligence)**.

As with any new technology, AI faces two main challenges: suboptimal performance and high costs. However, the good news is that efficiency improvements have been remarkable. For example, training a model at the level of GPT-4 today costs only a fraction of what it did two years ago—potentially one-tenth of the cost while achieving the same level of intelligence.

At the same time, the cost of inference is continuously decreasing. Compared to last year, the cost per unit of intelligence during inference has dropped by an order of magnitude, and it is likely to drop another order of magnitude next year. This makes AI business models more viable, as the cost of acquiring intelligence is dropping while the quality of intelligence is increasing. For users, this leads to a higher ROI (Return on Investment), which is why AI adoption is expected to grow rapidly—a key trend in the AI space.

These two important trends—improving intelligence during training and lowering the cost of using intelligence—combine to enable larger-scale AI deployments. The models will continue to evolve. If you look at OpenAI’s "o1," one of the key advancements will be the ability to handle tasks that humans would typically need a long time to complete. The model is not merely answering a simple question but going through a process of thought that may take 20 seconds.

Of course, this 20 seconds is relatively fast for a computer, but if a human were to process the same content, it might take them an hour or two. AI compresses this lengthy process, allowing it to handle increasingly long and complex tasks. This is an important trend for the future.

The Three Core Capabilities of Next-Generation Models

In the future, AI will likely be able to handle tasks that last for minutes or even hours, while seamlessly switching between different modalities, and its reasoning abilities will become increasingly robust. I see these as important trends in the upcoming development of AI.

We aim to integrate products and technology even more closely. The logic behind product design has evolved significantly from the earlier internet-based products. Nowadays, the capabilities of the model largely determine the product experience. If the model’s capabilities are lacking, the user experience cannot be fully realized.

A new concept is emerging: the model is the product.

When we were working on Kimi, we also focused on tightly aligning product features with model capabilities. For instance, when designing a product feature, it requires the underlying model to support it. One predictable demand in the future is the role of an AI assistant. In the age of AI, it's highly likely that the "super app" will be an assistant. The demand for intelligence is universal, although current capabilities are still in their early stages. Over time, as the technology improves and costs decrease, the market will increasingly adapt and embrace this new technology.

I believe there’s a high probability that in the next five to ten years, we will see large-scale market applications. The reason is simple: the demand AI addresses is fundamentally universal intelligence. Today’s software and apps are created by hundreds or thousands of engineers, so their intelligence level is fixed—it doesn't change over time.

However, AI products are different. Since they are powered by models, you could think of them as having the potential of millions of highly capable individuals working behind the scenes, completing various tasks. The ceiling for such models is incredibly high.

A crucial point is that to handle increasingly complex tasks, models must be able to support longer contexts. That's why we've been focusing on improving our models' ability to reason through extended contexts. Looking ahead, we will also focus heavily on productivity-related use cases.

I believe the greatest variable for this generation of AI lies in the productivity sector. Each unit of productivity in today’s society could potentially see a tenfold improvement, which is why we’re focusing on optimizing performance in these productivity scenarios. As performance improves, it directly correlates with the enhancement of model capabilities.

At the same time, I see another critical variable in AI today: data as a variable. When optimizing a system, data shouldn’t be treated as a constant or static resource, which differs from the traditional approach to AI research. For example, five to seven years ago, many people approached AI by working with fixed datasets—training models using specific datasets and then experimenting with different neural network structures or optimizers to improve performance. But all of this was done with static data.

Now, I believe data itself is becoming a dynamic variable—how we use data or gather feedback from users will become increasingly important. One key technique here is Reinforcement Learning from Human Feedback (RLHF), where AI learns from human feedback. Even if AI is highly intelligent, if it isn’t aligned with human values or doesn’t produce results that people want, it won’t offer a great user experience.

I believe the journey toward AGI will be a process of co-creation, not just a pure technological pursuit. Technology and products must merge seamlessly. Think of the product as an environment, where the model interacts with users and continuously learns from these interactions. This way, the model improves over time.

Since the Transformer model was introduced in 2018, we’ve conducted a great deal of research and exploration based on it. Initially, we didn’t expect the final results to reach the level we see today. However, moving forward, as long as the scaling law remains valid, the intelligence of these models will continue to rise.

For me, the exploration process has been enormous and driven by deep curiosity. There is uncertainty everywhere in this journey. Often, we tend to be more optimistic than reality warrants, simply because there are unknowns we’re unaware of. For example, when we first began this project, we anticipated many challenges, but as it turned out, no matter how much we predicted, the actual obstacles were always greater than we had imagined.

Even though the first principles of the technology are clear, there are too many unknowns. As Daniel Kahneman, author of Thinking, Fast and Slow, has said, we often attempt things we don’t fully understand, driven by the fact that we don’t know what we don’t know—and this ignorance grants us courage. When you start experimenting, you discover new challenges along the way, and that is perhaps the essence of innovation.

In most cases, your attempts may fail, but occasionally, you stumble upon a solution that works. This happens often in our office—you might see someone suddenly cheer, and you might think something’s wrong, but in fact, they’ve just discovered that a particular method worked. It’s that simple.

I believe that often the process of exploring what works and what doesn’t is the simplest path to discovering truth. This exploration isn’t limited to technology; it applies to products and business models as well. Finding what works, what doesn’t, and simply exploring answers is highly valuable in itself.

Geopolitechs

Discussion about this post