The open LLM community is an era when a hundred flowers bloom and compete. You can see Llama-3-70B-Instruct, QWen2-72B-Instruct, Nemotron-4-340B-Instruct, Mixtral-8x22BInstruct-v0.1 and many other excellent performances model. However, compared with proprietary large models represented by GPT-4-Turbo, open models still have significant gaps in many areas.
In addition to general models, there are also some open models that specialize in key areas have been developed, such as DeepSeek-Coder-V2 for programming and mathematics, InternVL 1.5 for visual-language tasks (which is used in some fields Comparable to GPT-4-Turbo-2024-04-09).
As the "shovel king in the AI gold rush era", NVIDIA itself is also making contributions to the field of open models, such as the ChatQA series of models it developed. Please refer to the report on this site"NVIDIA's new dialogue QA model is more accurate than GPT-4, But I was criticized: Unweighted code has little meaning.》. Earlier this year, ChatQA 1.5 was released, which integrates retrieval-augmented generation (RAG) technology and outperforms GPT-4 in conversational question answering.
Now, ChatQA has evolved to version 2.0. The main direction of improvement this time is to expand the context window.
Paper title: ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities
Paper address: https://arxiv.org/pdf/2407.14482
Recent time , extending the context window length of LLM is a major research and development hotspot. For example, this site once reported"Directly expand to infinite length, Google Infini-Transformer ends the context length debate".
All leading proprietary LLMs support very large context windows - you can feed it hundreds of pages of text in a single prompt. For example, the context window sizes of GPT-4 Turbo and Claude 3.5 Sonnet are 128K and 200K respectively. Gemini 1.5 Pro can support a context of 10M length, which is amazing.
However, open source large models are also catching up. For example, QWen2-72B-Instruct and Yi-34B support 128K and 200K context windows respectively. However, the training data and technical details of these models are not publicly available, making it difficult to reproduce them. In addition, the evaluation of these models is mostly based on synthetic tasks and cannot accurately represent the performance on real downstream tasks. For example, multiple studies have shown that there is still a significant gap between open LLM and leading proprietary models on real-world long context understanding tasks.
And the NVIDIA team successfully made the performance of open Llama-3 catch up with the proprietary GPT-4 Turbo on real-world long context understanding tasks.
In the LLM community, long context capabilities are sometimes considered a technology that competes with RAG. But realistically speaking, these technologies can enhance each other.
For LLM with a long context window, depending on the downstream tasks and the trade-off between accuracy and efficiency, you can consider attaching a large amount of text to the prompt, or you can use retrieval methods to efficiently extract relevant information from a large amount of text. RAG has clear efficiency advantages and can easily retrieve relevant information from billions of tokens for query-based tasks. This is an advantage that long context models cannot have. Long context models, on the other hand, are very good at tasks such as document summarization, which RAG may not be good at.
Therefore, for an advanced LLM, both capabilities are required so that one can be considered based on the downstream tasks and accuracy and efficiency requirements.
Previously, NVIDIA’s open source ChatQA 1.5 model has been able to outperform GPT-4-Turbo on RAG tasks. But they didn't stop there. Now they have open sourced ChatQA 2, which also integrates long context understanding capabilities that are comparable to GPT-4-Turbo!
Specifically, they are based on the Llama-3 model, extending its context window to 128K (on par with GPT-4-Turbo), while also equipping it with the best long context retriever currently available.
Expand the context window to 128K
So, how did NVIDIA increase the context window of Llama-3 from 8K to 128K? First, they prepared a long context pre-training corpus based on Slimpajama, using the method from the paper "Data engineering for scaling language models to 128k context" by Fu et al. (2024).
They also made an interesting discovery during the training process: Compared with using the original start and end tokens
to separate different documents will have a better effect. They speculate that the reason is that the
Using long context data for instruction fine-tuning
The team also designed an instruction fine-tuning method that can simultaneously improve the model's long context understanding capabilities and RAG performance.
Specifically, this instruction fine-tuning method is divided into three stages. The first two stages are the same as ChatQA 1.5, i.e. first training the model on the 128K high-quality instruction compliance dataset, and then training on a mixture of conversational Q&A data and provided context. However, the contexts involved in both stages are relatively short - the maximum sequence length is no more than 4K tokens. To increase the model's context window size to 128K tokens, the team collected a long supervised fine-tuning (SFT) dataset.
It adopts two collection methods:
1. For SFT data sequences shorter than 32k: using existing long context data sets based on LongAlpaca12k, GPT-4 samples from Open Orca, and Long Data Collections.
2. For data with sequence lengths between 32k and 128k: Due to the difficulty of collecting such SFT samples, they chose synthetic datasets. They used NarrativeQA, which contains both ground truth and semantically relevant paragraphs. They assembled all relevant paragraphs together and randomly inserted real summaries to simulate real long documents for question and answer pairs.
Then, the full-length SFT data set and the short SFT data set obtained in the first two stages are combined together and then trained. Here the learning rate is set to 3e-5 and the batch size is 32.
Long context retriever meets long context LLM
There are some problems with the RAG process currently used by LLM:
1. In order to generate accurate answers, top-k block-by-block retrieval will introduce non-negligible context fragments. For example, previous state-of-the-art dense embedding-based retrievers only supported 512 tokens.
2. A small top-k (such as 5 or 10) will lead to a relatively low recall rate, while a large top-k (such as 100) will lead to poor generation results because the previous LLM cannot be used well. Chunked context.
To solve this problem, the team proposes to use the most recent long context retriever, which supports thousands of tokens. Specifically, they chose to use the E5-mistral embedding model as the retriever.
Table 1 compares top-k retrieval for different block sizes and the total number of tokens in the context window.
Comparing the changes in the number of tokens from 3000 to 12000, the team found that the more tokens, the better the results, which confirmed that the long context capability of the new model is indeed good. They also found that if the total number of tokens is 6000, there is a better trade-off between cost and performance. When the total number of tokens was set to 6000, they found that the larger the text block, the better the results. Therefore, in their experiments, the default settings they chose were a block size of 1200 and top-5 text blocks.
Experiments
Evaluation benchmarks
In order to conduct a comprehensive evaluation and analyze different context lengths, the team used three types of evaluation benchmarks:
1. Long context benchmarks, more than 100K tokens;
2. Medium long context benchmark, less than 32K tokens;
3. Short context benchmark, less than 4K tokens.
If a downstream task can use RAG, it will use RAG.
Results
The team first conducted a Needle in a Haystack test based on synthetic data, and then tested the model’s real-world long context understanding and RAG capabilities.
1. Needle in a haystack test
Llama3-ChatQA-2-70B Can you find the target needle in the sea of text? This is a synthetic task commonly used to test the long-context ability of LLM and can be seen as assessing the threshold level of LLM. Figure 1 shows the performance of the new model in 128K tokens. It can be seen that the accuracy of the new model reaches 100%. This test confirmed that the new model has perfect long-context retrieval capabilities.
2. Long context evaluation over 100K tokens
On real-world tasks from InfiniteBench, the team evaluated the model’s performance when the context length exceeded 100K tokens. The results are shown in Table 2.
It can be seen that the new model performs better than many current best models, such as GPT4-Turbo-2024-04-09 (33.16), GPT4-1106 preview (28.23), Llama-3-70B-Instruct -Gradient-262k (32.57) and Claude 2 (33.96). In addition, the new model's score is very close to the highest score of 34.88 obtained by Qwen2-72B-Instruct. Overall, Nvidia’s new model is quite competitive.
3. Evaluation of medium-long contexts with the number of tokens within 32K
Table 3 shows the performance of each model when the number of tokens in the context is within 32K.
As you can see, GPT-4-Turbo-2024-04-09 has the highest score, 51.93. The score of the new model is 47.37, which is higher than Llama-3-70B-Instruct-Gradient-262k but lower than Qwen2-72B-Instruct. The reason may be that the pre-training of Qwen2-72B-Instruct heavily uses 32K tokens, while the continuous pre-training corpus used by the team is much smaller. Furthermore, they found that all RAG solutions performed worse than the long context solutions, indicating that all these state-of-the-art long context LLMs can handle 32K tokens within their context window.
4. ChatRAG Bench: Short context evaluation with the number of tokens less than 4K
On ChatRAG Bench, the team evaluated the performance of the model when the context length is less than 4K tokens, see Table 4.
The average score of the new model is 54.81. Although this result is not as good as Llama3-ChatQA-1.5-70B, it is still better than GPT-4-Turbo-2024-04-09 and Qwen2-72B-Instruct. This proves the point: extending short context models to long context models comes at a cost. This also leads to a research direction worth exploring: How to further expand the context window without affecting its performance on short context tasks?
5. Comparing RAG with long context
Tables 5 and 6 compare the performance of RAG with long context solutions when using different context lengths. When sequence length exceeds 100K, only the average scores for En.QA and En.MC are reported because the RAG settings are not directly available for En.Sum and En.Dia.
It can be seen that the newly proposed long context solution outperforms RAG when the sequence length of the downstream task is less than 32K. This means that using RAG results in cost savings, but at the expense of accuracy.
On the other hand, RAG (top-5 for Llama3-ChatQA-2-70B and top-20 for Qwen2-72B-Instruct) outperforms the long context solution when the context length exceeds 100K. This means that when the number of tokens exceeds 128K, even the current best long-context LLM may have difficulty achieving effective understanding and reasoning. The team recommends that in this case, use RAG whenever possible because it can bring higher accuracy and lower inference cost.
The above is the detailed content of NVIDIA dialogue model ChatQA has evolved to version 2.0, with the context length mentioned at 128K. For more information, please follow other related articles on the PHP Chinese website!