Automatic summarization technology has made significant progress in recent years, mainly due to paradigm shifts. In the past, the technology relied mainly on supervised fine-tuning on annotated data sets, but now uses large language models (LLM) for zero-shot prompts, such as GPT-4. Through careful prompt settings, fine control of summary length, theme, style and other features can be achieved without additional training
But one aspect is often overlooked: the information density of the summary. Theoretically, as a compression of another text, a summary should be denser, that is, contain more information, than the source file. Considering the high latency of LLM decoding, it is important to cover more information with fewer words, especially for real-time applications.
However, information density is an open question: if the abstract contains insufficient details, it is equivalent to no information; if it contains too much information without increasing the total length, it will become difficult to understand. To convey more information within a fixed token budget, it is necessary to combine abstraction, compression, and fusion.
In recent research, researchers from Salesforce, MIT, and others have attempted to determine the limits of increasing density by soliciting human preferences for a set of summaries generated by GPT-4. This method provides a lot of inspiration for improving the "expression ability" of large language models such as GPT-4
Paper link: https://arxiv.org/pdf/2309.04269 .pdf
Dataset address: https://huggingface.co/datasets/griffin/chain_of_density
Specifically, the researchers used the average number of entities per token as the density of representatives, generating an initial, entity-sparse summary. Then, they iteratively identify and fuse the 1-3 entities that were missing from the previous summary without increasing the total length (5 times the total length). Each digest has a higher entity to token ratio than the previous digest. Based on human preference data, the authors ultimately determined that humans prefer summaries that are nearly as dense as human-written summaries, and denser than summaries generated by ordinary GPT-4 prompts. The overall contribution of the study can be summarized as The following points:
The author formulated a single chain of density (CoD) Prompt, which generates an initial summary and makes its entity density continuously increase. Specifically, within a fixed number of interactions, a unique set of salient entities in the source text are identified and merged into the previous summary without increasing the length.
Examples of prompts and output are shown in Figure 2. The author does not explicitly specify the type of entity, but defines the missing entity as:
The author randomly selected 100 articles from the CNN/DailyMail summary test set to generate CoD summaries for them. For ease of reference, they compared CoD summary statistics to human-written bullet-point reference summaries and summaries generated by GPT-4 under the normal prompt: "Write a very short summary of the article. No more than 70 words."
Statistical situationIn the study, the author summarized from two aspects: direct statistical data and indirect statistical data. Direct statistics (tokens, entities, entity density) are directly controlled by CoD, while indirect statistics are an expected by-product of densification.
Direct statistics. As shown in Table 1, the second step reduced the length by an average of 5 tokens (from 72 to 67) due to the removal of unnecessary words from the initially lengthy summary. Entity density starts at 0.089, initially lower than human and Vanilla GPT-4 (0.151 and 0.122), and eventually rises to 0.167 after 5 steps of densification. Indirect statistics. The level of abstraction should increase with each step of CoD, as the abstract is repeatedly rewritten to make room for each additional entity. The authors measure abstraction using extraction density: the average square length of extracted fragments (Grusky et al., 2018). Likewise, concept fusion should increase monotonically as entities are added to a fixed-length summary. The authors expressed the degree of integration by the average number of source sentences aligned with each summary sentence. For alignment, the authors use the relative ROUGE gain method (Zhou et al., 2018), which aligns the source sentence with the target sentence until the relative ROUGE gain of the additional sentences is no longer positive. They also expected changes in content distribution, or the position within the article from which the summary content comes.
Specifically, the authors expected that CoD abstracts would initially exhibit a strong "lead bias" but would then gradually begin to introduce entities from the middle and end of the article. To measure this, they used alignment in fusion to rewrite content in Chinese, without the original sentence appearing, and measured the average sentence rank across all aligned source sentences.
Figure 3 confirms these hypotheses: as the number of rewriting steps increases, the abstraction increases (the left image shows lower extraction density), the fusion rate also increases (the middle image shows), and the abstract starts to include the middle of the article and the content at the end (shown on the right). Interestingly, all CoD summaries are more abstract compared to human-written summaries and baseline summaries
When rewriting the content, you need to rewrite it in Chinese , the original sentence does not need to appear
In order to better understand the tradeoff of CoD abstracts, the authors conducted a preference-based human study and conducted a rating-based evaluation using GPT-4.
Human preferences. Specifically, for the same 100 articles (5 steps *100 = 500 abstracts in total), the author randomly showed the "re-created" CoD abstracts and articles to the first four authors of the paper. Each annotator gave his or her favorite summary based on Stiennon et al.'s (2020) definition of a "good summary." Table 2 reports the first-place votes of each annotator in the CoD stage, as well as the summary of each annotator. Overall, 61% of first-place abstracts (23.0 22.5 15.5) involved ≥3 densification steps. The median number of preferred CoD steps is in the middle (3), with an expected step number of 3.06.
Based on the average density in the third step, the preferred entity density of all CoD candidates is approximately 0.15. As can be seen from Table 1, this density is consistent with human-written summaries (0.151), but significantly higher than summaries written with ordinary GPT-4 Prompt (0.122)
automatic measures. As a supplement to human evaluation (below), the authors used GPT-4 to score CoD summaries (1-5 points) along 5 dimensions: informativeness, quality, coherence, attributability, and overallness. As shown in Table 3, density correlates with informativeness, but up to a limit, with the score peaking at step 4 (4.74).
Judging from the average scores of each dimension, the first and last steps of CoD have the lowest scores, while the middle three steps have close scores (4.78, 4.77 and 4.76).
Qualitative analysis. There is a clear trade-off between abstract coherence/readability and informativeness. Two CoD steps are shown in Figure 4: one step's summary is improved by more detail, and the other step's summary is compromised. On average, intermediate CoD summaries best achieve this balance, but this tradeoff still needs to be precisely defined and quantified in future work.
For more details of the paper, please refer to the original paper.
The above is the detailed content of Salesforce collaborates with MIT researchers to open source GPT-4 revision tutorials to deliver more information with fewer words. For more information, please follow other related articles on the PHP Chinese website!