The final conclusion of the ACL 2024 paper: large language model ≠ world simulator, Yann LeCun: That's so right-AI-php.cn

If GPT-4 is only about 60% accurate when simulating state changes based on common sense tasks, then should we still consider using large language models as world simulators?

In the past two days, a paper selected for ACL 2024 "Can Language Models Serve as Text-Based World Simulators?" was posted on social media X It triggered a heated discussion, and even Turing Award winner Yann LeCun also participated.

This paper explores the question: Can current language models themselves act as world simulators and correctly predict how actions change different world states, thereby avoiding What about the need for a lot of manual coding?

In response to this issue, research from the University of Arizona, New York University, Johns Hopkins University, Microsoft Research, Allen Institute for Artificial Intelligence and other institutions The authors gave their answers in the context of a "text-based simulator".

They believe: Language models cannot be used as world simulators. For example, GPT-4 is only about 60% accurate when simulating state changes based on common sense tasks such as boiling water.

## Address: https://x.com/peterjansen_ai/status/1801687501557665841

Yann LeCun’s contribution to this paper Discovery expressed his agreement and believed that "without a world model, there is no plan."

##However, some people have expressed different views: the current accuracy of LLM (without targeted task training) can reach 60%, which does not mean that they are at least "to a certain extent" "world model"? And it will continue to improve as LLM iterates. LeCun also stated that the world model will not be an LLM.

ACL 2024论文盖棺定论：大语言模型≠世界模拟器，Yann LeCun：太对了 Back to the paper, the researchers built and used a new benchmark they called "ByteSized32-State-Prediction", which includes a text game state transition and Comes with a dataset consisting of game tasks. They use this benchmark for the first time to directly quantify the performance of large language models (LLMs) as text-based world simulators.

By testing GPT-4 on this dataset, the researchers found that although its performance is impressive, without further innovation, it will still be a Unreliable world simulator.

Therefore, the researchers believe that their work provides both new insights into the capabilities and weaknesses of current LLMs, as well as a way to track future progress as new models emerge. A new benchmark.

Paper address: https://arxiv.org/pdf/2406.06485

Method overview

The researchers explored the ability of LLMs to serve as world simulators in text-based virtual environments, where an agent receives observations and proposes actions in natural language to accomplish certain goals.

Each textual environment can be formally represented as a target conditional partially observable mal with a 7-tuple (S,A,T,O,R,C,D) POMDP, S represents the state space, A represents the action space, T : S×A→S represents the transformation function, O represents the observation function, R : S×A→R represents the reward function, C represents the description target and The natural language "context message" of action semantics, D: S×A→{0,1} represents the binary completion indicator function.

Large Model Simulator (LLM-Sim) task

The researcher proposed A prediction task, call it LLM as-a-Simulator (LLM-Sim), is used to quantitatively evaluate the ability of a language model to serve as a reliable simulator.

The LLM-Sim task is to implement a function F : C×S×A→S×R×{0,1} as a world simulator. In practice, a complete state transition simulator F should consider two types of state transitions: action-driven transitions and environment-driven transitions.

Figure 1 is an example of using LLM as a text game simulator: after the sink is opened, the cup in the sink is filled with water. The action-driven transition is that after taking the action to open the sink, the sink is opened (isOn=true); while the environment-driven transition is that when the sink is opened, water fills the cup in the sink.

ACL 2024论文盖棺定论：大语言模型≠世界模拟器，Yann LeCun：太对了

In order to better understand the ability of LLM to model each transition, the researchers further decomposed the simulator function F into three steps:

ACL 2024论文盖棺定论：大语言模型≠世界模拟器，Yann LeCun：太对了

Action driven transformation simulator: Given c, s_t and a_t, F_act: C×S×A→S predicts s^act_t+ 1, where s^act_t+1 represents the direct state change caused by the action.
Environment-driven transformation simulator: Given c and s^act_t+1, F_env: C×S→S predicts s_t+1, where s_t+1 It is the state generated after any environment-driven transformation.
Game Progress Simulator: Given c, s_t+1 and a_t, F_R: C×S×A→R×{0,1} predicted reward r_t+1 and game completion status d_t+1.

In addition, the researchers considered two variants of the LLM-Sim task

Complete state prediction: LLM outputs the complete state.
State difference prediction: LLM only outputs the difference between the input and output states.

Data and Evaluation

To accomplish this mission , the researchers introduced a new text game state transition data set. The data set is "BYTESIZED32-State-Prediction (BYTESIZED32-SP)", which contains 76,369 transformations, expressed as (c,s_t,rt,d_t,a_t,s^act_t+1,s_t+1,r_t +1,d_t+1) tuple . These transitions were collected from 31 different text games.

Additional corpus statistics are summarized in Table 1 below.

ACL 2024论文盖棺定论：大语言模型≠世界模拟器，Yann LeCun：太对了

Performance on LLM-Sim is determined by the model's prediction accuracy relative to the true labels on the test sample dataset. Depending on the experimental conditions, LLM must simulate object properties (simulate F_act, F_env or F) and/or game progress (simulate F_R or F), defined as follows:

Object properties: All objects in the game, the properties of each object (such as temperature, size), and the relationship with other objects (such as within or on another object).
Game progress: The status of the agent relative to the overall goal, including the current accumulated rewards, whether the game has been terminated, and whether the overall goal has been achieved.

The researcher noticed that in each case, LLM provides the ground truth previous state (when the function is F_env, the previous state is s^act_t+1 ) and the overall task context. That is, LLM always performs a single-step prediction.

Experimental results

Figure 1 above demonstrates the researcher’s use of contextual learning to evaluate LLM -Performance of the model in Sim tasks. They evaluated the accuracy of GPT-4 in the complete state and state-difference prediction mechanisms. The model receives the previous state (encoded as a JSON object), previous actions, and context messages, and generates the subsequent state (as a complete JSON object or difference).

Table 2 below demonstrates the accuracy of GPT-4 for simulating complete state transitions, as well as simulating action-driven transitions and environment-driven transitions individually.

ACL 2024论文盖棺定论：大语言模型≠世界模拟器，Yann LeCun：太对了

The researchers made the following important findings:

Predictive actions drive conversions better than predictions Environment-driven conversion is easier. In the best case, GPT-4 is able to correctly model 77.1% of dynamic action-driven transitions. In comparison, GPT-4 correctly simulates at most 49.7% of dynamic environment-driven transformations.

#It is easier to predict static conversions than dynamic conversions. As expected, in most cases it is much easier to model static transformations than dynamic transformations.

For dynamic states, it is easier to predict the complete game state; while for static states, it is easier to predict state differences. Predicting state differences in dynamic states can significantly improve performance (>10%) when simulating static transitions, while performance decreases when simulating dynamic transitions.

Game rules are important, and LLM can generate good enough game rules. When no game rules are provided in the context message, the performance of GPT-4 on all three simulation tasks degrades in most cases.

GPT-4 is able to predict game progress in most cases. Table 3 below shows the results of GPT-4 predicting game progress. With game rule information in context, GPT-4 can correctly predict game progress in 92.1% of test cases. The presence of these rules is crucial in context: without them, GPT-4's prediction accuracy drops to 61.5%.

ACL 2024论文盖棺定论：大语言模型≠世界模拟器，Yann LeCun：太对了

Human performance on LLM-Sim tasks is better than GPT-4. The researchers conducted preliminary human studies on the LLM-Sim task. The results are shown in Table 4 below.

It was found that the overall accuracy of humans was 80%, while the accuracy of sampled LLM was 50%, with little variation between different annotators. This shows that while the task is generally intuitive and relatively easy for humans, there is still considerable room for improvement for LLMs.

ACL 2024论文盖棺定论：大语言模型≠世界模拟器，Yann LeCun：太对了

GPT-4 is more prone to errors when arithmetic, common sense, or scientific knowledge is required. Figure 2 below shows the proportion of predicted results that were correct, the proportion that set the attribute to an incorrect value, or the proportion that failed to change the attribute value for overall state transitions, action-driven transitions, and environment-driven transitions.

We can observe that GPT-4 is able to handle most simple boolean attributes well. Errors cluster around non-trivial properties that require arithmetic (e.g. temperature, timeAboveMaxTemp), common sense (e.g. current_aperture, current_focus), or scientific knowledge (e.g. on).

ACL 2024论文盖棺定论：大语言模型≠世界模拟器，Yann LeCun：太对了

For more technical details and experimental results, please refer to the original paper.

The above is the detailed content of The final conclusion of the ACL 2024 paper: large language model ≠ world simulator, Yann LeCun: That's so right. For more information, please follow other related articles on the PHP Chinese website!