OpenAI の「Strawberry」に関するニュースを速報しているアカウントは、実はインテリジェントエージェントなのでしょうか?スタンフォードのスタートアップ「誇大広告」 AgentQ-AI-php.cn

誇大宣伝によって「膨大なトラフィック」が発生すると、その製品が素晴らしいかどうかなど誰も気にしなくなります。

最近、OpenAI の秘密プロジェクト「Q*」が内部関係者の間で広く注目を集めています。先月、これをベースにした「Strawberry」というコードネームのプロジェクトが再び暴露された。おそらく、このプロジェクトは高度な推論機能を提供できると考えられます。

ここ数日、このプロジェクトに関して、ネット上で「死んだハトは人間の命の代償を払わない」という噂が何度か流れています。特にノンストップを推進する「ブラザーストロベリー」のアカウントは人々に期待を与えると同時に失望させることもある。

OpenAI の「Strawberry」に関するニュースを速報しているアカウントは、実はインテリジェントエージェントなのでしょうか?スタンフォードのスタートアップ「誇大広告」 AgentQ

意外なことに、このサム・アルトマンがどこに登場しても、彼が投稿する「マーケティングアカウント」は実際に彼の皮の下に知性体があるのでしょうか？

今日、AIエージェントのスタートアップ企業「MultiOn」の創設者が直接主張しました：OpenAIが「Q*」をリリースするのを待たずに、「Strawberry」を制御する 新しいエージェントAgentをリリースしましたブラザー」アカウントQ、オンラインで一緒に遊びに来てください！联 Multion の共同創設者兼 CEO の DIV GARG 氏は、スタンフォード大学でコンピュータサイエンスの博士号を取得し、休学しました。

^{OpenAI が自社でウェディングドレスを作るというマーケティング活動を行ったことで、皆が混乱しているようです。結局のところ、最近多くの人が OpenAI の「ビッグニュース」を待って徹夜しているのです。これはサム・アルトマンと「ブラザー・ストロベリー」とのやりとりに遡りますが、サム・アルトマンが投稿したイチゴの写真の下で、彼は「もうすぐサプライズが来るでしょう」と返信しました。}

しかし、「MultiOn」の創設者であるDiv Gargは、エージェントQが「ブラザー・ストロベリー」であると主張する投稿を静かに削除しました。

今回、「MultiOn」は、リリースした

エージェントQが画期的なAIエージェント

であることを発表しました。そのトレーニング方法は、モンテカルロツリー検索 (MCTS) と自己批判を組み合わせたもので、直接優先最適化 (DPO) と呼ばれるアルゴリズムを通じて人間のフィードバックから学習します。

同時に、計画機能と AI 自己修復機能を備えた次世代 AI エージェントとして、エージェント Q のパフォーマンスは、LLama 3 のベースラインゼロサンプルパフォーマンスよりも 3.4 倍高くなります。同時に、実際のシナリオのタスクの評価では、エージェント Q の成功率は 95.4% に達しました。

エージェント Q は何ができますか?まずは公式デモを見てみましょう。

特定の時間に特定のレストランの席を予約できます。

その後、空き状況の確認などの Web 操作を実行します。ようやく予約に成功しました。

OpenAI の「Strawberry」に関するニュースを速報しているアカウントは、実はインテリジェントエージェントなのでしょうか?スタンフォードのスタートアップ「誇大広告」 AgentQ さらに、フライト（今週土曜日のニューヨークからサンフランシスコへのフライト、片道、窓側の座席、エコノミークラスなど）を予約することもできます。

OpenAI の「Strawberry」に関するニュースを速報しているアカウントは、実はインテリジェントエージェントなのでしょうか?スタンフォードのスタートアップ「誇大広告」 AgentQ

しかし、ネチズンはエージェントQを購入していないようです。誰もがもっと懸念しているのは、本当に「Strawberry Brother」アカウントを宣伝に使用しているかどうかです。中には、彼らを恥知らずな嘘つきだと呼ぶ人もいます。

Overview of important components and methods

Currently, related papers on Agent Q have been released, jointly written by researchers from MultiOn and Stanford University. The results of this research will be available to developers and general users of MultiOn later this year.

Paper address: https://multion-research.s3.us-east-2.amazonaws.com/AgentQ.pdf

To summarize: Agent Q can autonomously The web implements planning and self-correction, learning from successes and failures to improve its performance on complex tasks. Ultimately, the agent can better plan how to surf the Internet, adapting to real-world complexities.

In technical details, the main components of Agent Q include the following:

Using MCTS (Monte Carlo Tree Search, Monte Carlo Tree Search) for guided search: This technology explores different Operations and web pages autonomously generate data to balance exploration and exploitation. MCTS uses high sampling temperatures and diverse cues to expand the operating space, ensuring a diverse and optimal set of trajectories.

AI Self-Criticism: At every step, AI-based self-criticism provides valuable feedback to improve the agent’s decision-making process. This step-level feedback is crucial for long-term tasks, as sparse signals often lead to learning difficulties.

Direct Preference Optimization (DPO): This algorithm builds preference pairs from data generated from MCTS to fine-tune the model. This off-policy training approach allows the model to efficiently learn from aggregated data sets, including suboptimal branches explored during search, thereby improving success in complex environments.

The following focuses on the MCTS algorithm on the web page (Web-Page) side. Researchers have explored how to give agents additional search capabilities through MCTS.

In previous work, the MCTS algorithm usually consists of four stages: selection, expansion, simulation, and backpropagation. Each stage plays a key role in balancing exploration and utilization, and iteratively refining the strategy.

The researchers formulated web page agent execution as a web page tree search, where the state consists of the agent history and the DOM tree of the current web page. Unlike board games such as chess or Go, the complex network agents used by the researchers operate in an open-format and changeable space.

The researchers use the base model as an action-proposal distribution and sample a fixed number of possible actions on each node (webpage). Once an action is selected and performed in the browser, the next web page is traversed and becomes a new node along with the updated history.

The researcher queries the feedback model multiple iterations, each time removing from the list the best operation selected from the previous iteration until all operations are completely sorted. Figure 4 below shows the complete AI feedback process.

Expansion and backtracking. The researcher selects and performs an action in the browser environment to reach a new node (page). Starting from the selected state node trajectory, they expand the trajectory using the current policy ?_? until they reach the terminal state. The environment returns a reward ? at the end of the trajectory, where ? = 1 if the agent succeeds and ? = 0 otherwise. Next, this reward is backpropagated by updating the value of each node bottom-up from the leaf node to the root node, as follows:

Figure 3 below shows all the results and the baseline. When enabling the agent to search for information at test time, i.e. applying MCTS to the base xLAM-v0.1-r model, the success rate increased from 28.6% to 48.4%, approaching the 50.0% average human performance and significantly exceeding only Performance of zero-shot DPO models trained through outcome supervision.

The researchers further fine-tuned the base model based on the algorithm outlined in the figure below, and the result was an improvement of 0.9% over the base DPO model. Applying MCTS on the carefully trained Agent Q model, the agent's performance increased to 50.5%, slightly exceeding the average human performance.

They believe that even if an agent has undergone extensive reinforcement learning training, having search capabilities at test time is still an important paradigm shift. This is a significant improvement over untrained zero-shot agents.

Furthermore, although intensive-level supervision is an improvement over pure outcome-based supervision, in the WebShop environment, the improvement effect of this training method is not large. This is because in this environment, the agent only needs to make short decision paths and can learn credit allocation through the results.

Evaluation results

The researchers chose the task of letting the agent book a restaurant on the OpenTable official website to test how the Agent Q framework performs in the real world. To complete this ordering task, the agent must find the restaurant's page on the OpenTable website, select a specific date and time, select seats that match the user's preferences, and finally submit the user's contact information before the reservation can be successful.

Initially, they conducted experiments on the xLAM-v0.1-r model, but the model performed poorly, with an initial success rate of only 0.0%. So they turned to the LLaMa 70B Instruct model, with some initial success.

However, since OpenTable is a real-time environment, it is difficult to measure and evaluate through programming or automation. Therefore, the researchers used GPT-4-V to collect rewards for each trajectory based on the following metrics: (1) date and time are set correctly, (2) party size is set correctly, (3) user information is entered correctly, and (4) clicks Complete your reservation. If all the above conditions are met, the agent is deemed to have completed the task. The resulting supervision setup is shown in Figure 5 below.

And Agent Q significantly improved the zero-shot success rate of the LLaMa-3 model from 18.6% to 81.7%. This result was achieved after only a single day of autonomous data collection, which is equivalent to a 340% surge in success rate. After the introduction of online search capabilities, the success rate climbed to 95.4%.

Please refer to the original paper for more technical details and evaluation results.

^{Reference link: https://www.multion.ai/blog/introducing-agent-q-research-breakthrough-for-the-next-generation-of-ai-agents-with-planning- and-self-healing-capabilities}

以上がOpenAI の「Strawberry」に関するニュースを速報しているアカウントは、実はインテリジェントエージェントなのでしょうか?スタンフォードのスタートアップ「誇大広告」 AgentQの詳細内容です。詳細については、PHP 中国語 Web サイトの他の関連記事を参照してください。