Conversation with Tsinghua Huang Minlie: Borrowing the hierarchical definition of AI dialogue system for autonomous driving, the virtual companion of the Metaverse may be located at L5-AI-php.cn

This article is reproduced from Lei Feng.com. If you need to reprint, please go to the official website of Lei Feng.com to apply for authorization.

"I'm so happy I get to be next to you and look at the world through your eyes. )."

This is a line in the movie "Her", spoken by the AI voice assistant Samantha to the male protagonist. This sentence is a great comfort to the male protagonist who is lost in the steel forest and feels lost and powerless.

Samantha is an almost universal self-learning operating system. She can help the male protagonist select the best letters and send them to his favorite publishing house for publication; she can instantly roam the entire human knowledge base according to the male protagonist's needs and search for the most suitable response plan for him; her most powerful function She also has emotional companionship, and all the confusion and displeasure of the male protagonist can be resolved by her warmth during the conversation...

As a cutting-edge scholar in the field of NLP in China, a professor of computer science at Tsinghua University Huang Minlie applied NLP technology to the mental health track and led the development of the AI emotional conversation robot Emohaa. When interviewing Professor Huang Minlie, he mentioned the film "Her" released in 2013, and his words showed his appreciation, or expectation, for this science fiction film. As a colleague who develops AI dialogue systems, he looks forward to the empathetic AI dialogue system in "Her" really appearing in reality and achieving a leap forward in the industry.

This makes people ask: How difficult would it be to make an AI dialogue system perform complex emotional tasks like Samantha, soothe emotions and heal people's hearts? How to quantify this difficulty? How to measure whether an AI dialogue system reaches Samantha's level?

This is not an unrealistic question. In fact, with the explosive growth of AI dialogue systems today, dialogue products such as "Xiaodu", "Xiaoai", Google's dialogue robot "Meena", Facebook's chatbot "Blender" and so on are emerging one after another. However, the current lack of standards for AI dialogue systems has resulted in uneven levels of application and different evaluation systems. This has led to misunderstandings in the industry about the level of interaction of artificial intelligence due to inconsistent cognition, and has also caused social problems. Extensive discussions on consciousness, ethics, morality, etc.

Some scientists who are engaged in the development of AI dialogue systems have said that they often find it difficult to judge the level of the AI dialogue systems they develop. Scientists believe that the industry urgently needs a standard for grading the level of AI dialogue systems. After the grading standards are formulated, the ability level of the AI dialogue system will be measured with evidence.

Therefore, in order to better evaluate the capability level of the AI dialogue system, Professor Huang Minlie teamed up with academic and industry research institutions to formulate the world's first grading concept from L0 to L5 in autonomous driving. The "AI Dialogue System Grading Definition" (hereinafter referred to as the "Grading Definition") was officially released on June 28.

Conversation with Tsinghua Huang Minlie: Borrowing the hierarchical definition of AI dialogue system for autonomous driving, the virtual companion of the Metaverse may be located at L5 Note: Professor Huang Minlie explains the hierarchical definition of AI dialogue system

of "Grade Definition" The emergence of AI dialogue systems may promote the application of AI dialogue systems in fields such as virtual personal assistants, smart homes, smart car voice, emotional care, and mental health. It will also accelerate the development and implementation of next-generation AI dialogue systems, which will have a great impact on academia and industry. It has important reference significance for the world's research on speech language dialogue systems.

Focusing on "Grade Definition", AI Technology Review had a conversation with Professor Huang Minlie. The following is the content of the conversation:

AI Technology Comment: What gave you the idea to grade the AI dialogue system?

黄民lie：Currently there is a problem in our evaluation of the dialogue system: today’s technical routes and architecture are in full bloom , difficult to compare with each other. For example, I want to compare a smart speaker with a chatbot, but I cannot compare their conversational capabilities because the level of conversational systems is uneven, there is a lack of a unified evaluation system, and there is a lack of a clear definition of capabilities.

We have certain evaluation indicators in the task-based dialogue system, certain evaluation indicators in the chat-type dialogue system, and certain evaluation indicators in the knowledge-based dialogue system. How should the indicators be unified? This is The main issues considered in the "Classification Definition". Therefore, we learned from the grading definition of autonomous driving from L0 to L5, and also used L0-L5 to grade the AI dialogue system.

AI Technology Comment: Please explain to us the specific definition of AI dialogue system grading.

Huang Minlie: The classification of autonomous driving is divided into six levels from L0 to L5, where L0 refers to complete Manual driving, L5 is fully autonomous driving, the vehicle takes over everything. L1-L4 realize autonomous driving under certain specific conditions. The classification of autonomous driving mainly involves the proportion of people and vehicles taking charge of driving, and the definition is relatively simple. However, the dialogue system is quite complex. It has many technical routes and technical architectures, many tasks, and many evaluation indicators. After discussion, we believe that it ultimately needs to meet five basic principles:

First, it only focuses on dialogue systems that are completely dominated by machines. Human-machine hybrid dialogue systems are not considered; second, it starts from the perspective of system performance capabilities and user perception, without considering the specific technical implementation of the system; third Third, the ability levels corresponding to each graded definition need to be observable, testable, and measurable; fourth, task types such as assistants, chats, and knowledge dialogues are not distinguished, and all are expressed in "scenarios"; fifth, we hope Measuring the capability level of a dialogue system can provide suggestions for research directions and references for practical applications of dialogue systems.

Based on these five principles, we give the definition of AI dialogue system grading:

L0 The actual dialogue is given by a human, The system has no automatic dialogue capabilities at all, or in any single scene, the system cannot provide high-quality dialogue.

L1 can complete higher-quality dialogue in a single scene, but it has no way to handle contextual dependencies between scenes. For example, let’s say I’m going on a business trip, I’ve booked a flight to Nanjing, and I need to book a hotel. Since I am going to Nanjing on business, I must book a hotel in Nanjing. This is the context dependence between scenes. The context dependence formed between booking a flight ticket and booking a hotel cannot be processed by L1.

L2 is based on L1 and can complete higher-quality dialogue in multiple scenes at the same time, with cross-scene context dependence and the ability to switch naturally. I just talked about booking air tickets and hotels, and also asking what the weather is like and what tourist attractions there are. This is to naturally switch flexibly between different tasks and different scenes. This ability is very critical on L2, but L2 has no way to complete higher-quality dialogue in new scenes.

Based on L2, L3 can carry out high-quality dialogue for a large number of scenarios, and also has high-quality dialogue capabilities in new scenarios. I mentioned a "massive scenario" here, maybe you're asking what "massive" is? Does ten count, does twenty count, does thirty count? In order to achieve a wider integration of standards and definitions, we have not given a specific quantitative definition, but the ability to have higher-quality conversations in new and unseen scenarios is a critical ability.

L4 refers to the ability to have high-quality dialogue in new scenarios and to be personified in multiple rounds of interactions (referring to the consistency of personality, personality, emotional viewpoints, etc.) to a higher degree. This is just like when we are chatting with a person, the other person cannot be a man one day, a woman another, or study at Tsinghua University and Peking University at another time - people have their own fixed personality information, and this kind of person It is still very difficult to process information in dialogue systems. At present, we can make the dialogue system reflect personality to a certain extent, but it is still far from a truly human-like level.

L5 is a step up from L4. L5 has a high degree of anthropomorphism in multiple rounds of interactions, can actively learn and continue learning in open scene interactions, and has multiple Modal perception and expressive ability. This is like telling a child that what you are doing is wrong, and the child will learn from it. In the future, we hope that the L5 dialogue system can remember and learn what is right and what is wrong when we tell it. During the interaction process, we also hope that the L5 dialogue system has multi-modal perception and expression capabilities, so that it can truly enter the metaverse and various virtual human scenes, be able to truly make expressions and movements, and understand the other party's expressions, Actions and emotions and so on.

The above is the basic definition from L0 to L5 in the "AI Dialogue System Hierarchical Definition".

AI Technology Comment: How do you define the “higher quality” and “high quality” you just mentioned?

Huang Minlie: What is high quality and higher quality? In fact, we have a set of evaluation criteria. The full score is 10 points. High quality means that the score in the three dimensions of relevance, information content, and naturalness can reach 8-10 points. Higher quality means 6-8 points, and low quality means less than 6 points.

What do these three dimensions mean? Relevance means that the content of the reply appropriately matches the previous text; informativeness means that the reply provides enough necessary information. Replies like "I don't know" and "good" do not have any information; naturalness means that they are consistent with the previous text. How natural it is compared to people, whether the grammar of the dialogue system is smooth, whether there are common sense errors, etc.

And how to measure this score? A certain number of testers can conduct full dialogue interactions with this dialogue system, and the testers will subjectively score the dialogue system from three dimensions, much like the Amazon Alexa Prize competition evaluation method.

Note: The purpose of the Amazon Alexa Prize competition is to provide a standard development environment and testing framework to promote the progress of the comprehensive capabilities of conversational robots. The prize is up to 3.5 million US dollars. According to the scoring system of the competition, in the three years of 2019, 2020, and 2022, the average score of the best system evaluated by the competition is between 3.1 and 3.6 points, which meets the requirements of coherence, contextual understanding, and fluency. Ability to chat with people for 10-14 minutes after responding to three conditions.

AI Technology Review: What is the significance of defining the classification of AI dialogue systems?

Huang Minlie: The first psychotherapy robot Eliza appeared in 1966. As of now, the AI dialogue system It has been developed for nearly 60 years. In the past 60 years, great progress has been made in both the application of dialogue systems and algorithm models. But we will also find that there are various inconsistencies and even differences in industrial practices and public perceptions. Moreover, in recent years, AI dialogue systems have developed from the first generation based on rules and the second generation with traditional machine learning as the core to the third generation with big data and large models as prominent features, showing outstanding performance on open topics. Amazing dialogue ability, dialogue ability has also produced revolutionary changes.

This revolutionary change brings us many new questions, such as: Will the AI dialogue system have personality? Will there be emotions? Can AI dialogue systems become virtual companions? And so on, and these issues extend to further discussions on social cognition and ethics.

For example, there was news on June 12 that Blake Lemoine, a Google AI ethics researcher, believed that the LaMDA language model has personality, because during the chat with LaMDA, LaMDA revealed It believes that it has consciousness and feelings. It also said, "I am aware of my own existence, I am eager to understand the world better, and sometimes feel happy or sad." There are different opinions on this on the Internet, and everyone is discussing whether AI has it. Personality and consciousness.

Let’s talk about the Metaverse. The Metaverse hopes to replicate the real world into the Internet, allowing people in the real world to interact in the online world. The AI dialogue system is of great use in the metaverse. For example, AI shopping guides can provide unique suggestions based on user preferences and so on. This requires us to have excellent conversational interaction capabilities in the future, otherwise this kind of human-machine communication will be unnatural and soulless, and the metaverse we want to achieve will not be established.

So, based on the foreseeable future vigorous development of AI dialogue systems, and the huge opportunities and many confusions this development may bring to human beings, we are exploring at this point in time The significance of the classification definition is very significant.

AI Technology Comment: In the movie "Her", because Samantha can handle complex emotional tasks, the male protagonist fell in love with her and fell into an emotional crisis, then the same Is it possible that an AI dialogue system that has reached L4-L5 can cause such a problem? Does this involve ethical issues?

Huang Minlie: Yes, with the development of the dialogue system, it may lead to very prominent ethical issues, because it challenges the existing ethical order and existing social cognition. Therefore, when formulating the "Grading Definition", our team invited Professor Zhang Hongzhong, Dean of the School of Journalism and Communication at Beijing Normal University. In our follow-up work, Professor Zhang will promote it to management departments and social science circles as soon as possible. After letting relevant departments and academic circles understand it, he will intuitively help us formulate corresponding policies, regulations and ethical issues from the technical logic. This is very important. Be targeted.

AI Technology Comment: What level do the existing AI dialogue system products currently on the domestic market belong to in the "Grading Definition"?

Huang Minlie: Professor Wang Bin, Director of Xiaomi Technical Committee and Director of AI Laboratory, worked with us to formulate "Classification Definition". He is currently responsible for leading the development of the intelligent question and answer and chat functions of Xiaomi’s smart life assistant “Xiao Ai Classmate”. Let’s take Xiao Ai Classmate as an example. I think Xiao Ai has certain cross-scenario abilities, and her level should be between L2-L3. At present, the level of products in the domestic industry is generally in the L2-L3 range, and the better ones are in the L3 range.

AI Technology Comment: So what level do foreign AI dialogue system products generally belong to?

Huang Minlie: Currently, in terms of products, there is no significant difference between domestic and foreign products. And it is worth noting that it is more difficult for us to build a Chinese AI dialogue system than in English, because the culture and concept of open source content in English are better, and it is easier to obtain high-quality data in English; on the other hand, the language characteristics of Chinese A little more difficult than English.

AI Technology Comment: What are the technical difficulties in upgrading from the current status of most products to L4-L5?

Huang Minlie: First, you must have the ability to remember; second, you must have the ability to associate and reason Ability, as well as the ability of self-learning; third, the key point of L4-L5 is multi-modality. If the AI dialogue system wants to be applicable in the metaverse, it is very important for the AI dialogue system to recognize expressions, understand speech, and feel the user's emotions from speech. Whether it can perform highly expressive speech synthesis and actions And the fine-grained expression of expressions are also very important difficulties.

AI Technology Review: Can standards such as "Grade Definition" be implemented through private formulation? Or does it need to be approved by the state, and then the relevant standards will be formulated by the authorities?

Huang Minlie: "Classification Definition" is not a standard. First of all, we want to discuss this issue from an academic perspective, hoping to promote public awareness, and at the same time, we hope to provide some systematic thinking for system development and research directions in the industry. At this stage, we cannot say that the "Grading Definition" has become a fixed standard. It is currently only a suggestion or a guideline. In the future, we will have to do more work to promote it into a standard recognized by everyone. This is a long-term process, and the release of the "Grade Definition" is only the first step in the standardized and systematic development of AI dialogue systems.

AI Technology Review: As you said, what kind of work is needed to make the "AI Dialogue System Hierarchical Definition" widely recognized and applied?

Huang Minlie: In the future, we plan to cooperate with relevant research institutions with the support of CCF (China Computer Federation) Work with researchers to compile a white paper, focus on the development process of AI dialogue systems, and explain in detail the purpose and standards of the "Grade Definition".

In addition, we hope to promote a competition similar to the Amazon Alexa Prize competition, which is a long-term goal that requires financial support. We hope to create a unified development environment, unified data set, and unified testing framework to truly compare different dialogue systems. I know Baidu has similar ideas, but it's not open enough. We will unify the efforts of all parties in the future, with the goal of promoting the progress of dialogue system research, while also promoting industrial implementation and achieving some new developments in practical applications.

The above is the detailed content of Conversation with Tsinghua Huang Minlie: Borrowing the hierarchical definition of AI dialogue system for autonomous driving, the virtual companion of the Metaverse may be located at L5. For more information, please follow other related articles on the PHP Chinese website!