Implementing Tsinghua UltraChat multi-round conversations using multiple ChatGPT APIs-AI-php.cn

Since the release of ChatGPT, the popularity of conversation models has only increased during this period. While we admire the amazing performance of these models, we should also guess the huge computing power and massive data support behind them.

As far as data is concerned, high-quality data is crucial, and for this reason OpenAI has put a lot of effort into data and annotation work. Multiple studies have shown that ChatGPT is a more reliable data annotator than humans. If the open source community can obtain large amounts of dialogue data from powerful language models such as ChatGPT, it can train dialogue models with better performance. This is proven by the Alpaca family of models – Alpaca, Vicuna, Koala. For example, Vicuna replicated ChatGPT’s nine-step success by fine-tuning instructions for the LLaMA model using user sharing data collected from ShareGPT. Increasing evidence shows that data is the primary productivity for training powerful language models.

ShareGPT is a ChatGPT data sharing website where users upload ChatGPT answers they find interesting. The data on ShareGPT is open but trivial and needs to be collected and organized by researchers themselves. If there is a high-quality, wide-ranging data set, the open source community will get twice the result with half the effort in developing conversation models.

Based on this, a recent project called UltraChat systematically constructed an ultra-high-quality conversation data set. The project authors tried to use two independent ChatGPT Turbo APIs to conduct conversations to generate multiple rounds of conversation data.

调用多个ChatGPT API相互对话，清华开源的多轮对话数据UltraChat来了

## Project address: https://github.com/thunlp/UltraChat
Dataset address: http://39.101.77.220/
Dataset interaction address: https://atlas. nomic.ai/map/0ce65783-c3a9-40b5-895d-384933f50081/a7b46301-022f-45d8-bbf4-98107eabdbac

Specifically, the project aims to We are building an open source, large-scale, multi-round dialogue data based on Turbo APIs to facilitate researchers to develop powerful language models with universal dialogue capabilities. In addition, taking into account privacy protection and other factors, the project will not directly use data on the Internet as prompts. In order to ensure the quality of the generated data, the researchers used two independent ChatGPT Turbo APIs in the generation process, in which one model plays the role of the user to generate questions or instructions, and the other model generates feedback.

调用多个ChatGPT API相互对话，清华开源的多轮对话数据UltraChat来了

If you directly use ChatGPT to generate it freely based on some seed conversations and questions, it is prone to problems such as single topics and repeated content, making it difficult to guarantee data. diversity itself. To this end, UltraChat has systematically classified and designed the topics and task types covered by the conversation data, and also conducted detailed prompt engineering for the user model and reply model, which consists of three parts:

Questions about the World: This part of the conversation comes from broad inquiries about concepts, entities, and objects in the real world. The topics covered cover technology, art, finance and other fields.
Writing and Creation: This part of the dialogue data focuses on instructing the AI to create a complete text material from scratch, and based on this, follow-up questions or further guidance To improve your writing, content types include articles, blogs, poems, stories, plays, emails, and more.
Assisted rewriting (Writing and Creation) of existing data: The dialogue data is generated based on existing data. Instructions include but are not limited to rewriting, continuation, translation, induction, reasoning, etc., and the topics covered are also very diverse.

These three parts of data cover most users’ requirements for AI models. At the same time, these three types of data will also face different challenges and require different construction methods.

For example, the main challenge of the first part of the data is how to cover common knowledge in human society as widely as possible in a total of hundreds of thousands of conversations. To this end, the researchers used automatically generated topics and sources from Wikidata Two aspects of entities are filtered and constructed.

The challenges in the second and third parts mainly come from how to simulate user instructions and make the generation of user models as diverse as possible in subsequent conversations without deviating from the ultimate goal of the conversation ( Generate materials or rewrite materials as required), for which the researchers have fully designed and experimented with the input prompts of the user model. After the construction was completed, the authors also post-processed the data to weaken the hallucination problem.

Currently, the project has released the first two parts of the data, with a data volume of 1.24 million, which should be the largest related data set in the open source community. The content contains rich and colorful conversations in the real world, and the final part of the data will be released in the future.

World problem data comes from 30 representative and diverse meta-themes, as shown in the figure below:

调用多个ChatGPT API相互对话，清华开源的多轮对话数据UltraChat来了

Based on the above meta-themes, this project generated 1100 sub-themes for data construction;
For each sub-theme , generate up to 10 specific questions;
Then use the Turbo API to generate new related questions for each of the 10 questions;
For each question, the two models are iteratively used to generate 3 to 7 dialogue rounds as described above.

Additionally, this project collected the 10,000 most commonly used named entities from Wikidata; used the ChatGPT API to generate 5 meta-questions for each entity; for each meta Questions, 10 more specific questions and 20 related but general questions were generated; 200,000 specific questions, 250,000 general questions and 50,000 meta-questions were sampled, and 3~7 dialogue rounds were generated for each question.

Next let’s look at a specific example:

调用多个ChatGPT API相互对话，清华开源的多轮对话数据UltraChat来了

We tested the data on the UltraChat platform Search results. For example, if you enter "music", the system will automatically search for 10,000 sets of music-related ChatGPT conversation data, and each set is a multi-round conversation

调用多个ChatGPT API相互对话，清华开源的多轮对话数据UltraChat来了

The search results for entering the keyword "mathematics (math)", there are 3346 groups of multi-round conversations:

调用多个ChatGPT API相互对话，清华开源的多轮对话数据UltraChat来了

Currently, UltraChat covers There are already many information fields, including medical, education, sports, environmental protection and other topics. At the same time, the author tried to use the open source LLaMa-7B model to perform supervised instruction fine-tuning on UltraChat, and found that after only 10,000 steps of training, there were very impressive effects. Some examples are as follows:

调用多个ChatGPT API相互对话，清华开源的多轮对话数据UltraChat来了

World Knowledge: List 10 good Chinese and American universities respectively

调用多个ChatGPT API相互对话，清华开源的多轮对话数据UltraChat来了

Imagination question: What are the possible consequences after space travel becomes possible?

调用多个ChatGPT API相互对话，清华开源的多轮对话数据UltraChat来了

Syllogism: Is a whale a fish?

调用多个ChatGPT API相互对话，清华开源的多轮对话数据UltraChat来了

##Hypothetical question: Prove that Jackie Chan is better than Bruce Lee

调用多个ChatGPT API相互对话，清华开源的多轮对话数据UltraChat来了

Overall, UltraChat is a high-quality, wide-ranging ChatGPT conversation data set that can be combined with other data sets to significantly improve the quality of open source conversation models. At present, UltraChat only releases the English version, but it will also release the Chinese version of the data in the future. Interested readers are welcome to explore it.

The above is the detailed content of Implementing Tsinghua UltraChat multi-round conversations using multiple ChatGPT APIs. For more information, please follow other related articles on the PHP Chinese website!