新一代注意力機制Lightning Attention-2：無限序列長度、恆定算力開銷、更高建模精度-人工智慧-PHP中文網

當前大語言模型的應用受到了序列長度限制的限制，這限制了其在人工智慧領域中的應用。例如，在多輪對話、長文本理解和多模態資料處理與生成方面存在一定的挑戰。造成這種限制的根本原因是目前大語言模型普遍採用的Transformer架構，其計算複雜度與序列長度呈現二次關係。因此，隨著序列長度的增加，計算資源的需求會呈現幾何倍數成長。因此，如何有效率地處理長序列一直是大語言模型所面臨的挑戰之一。

過去的方法主要集中在讓大語言模型在推理階段適應更長的序列。其中一種方法是採用Alibi或類似的相對位置編碼，以使模型能夠自適應不同長度的輸入序列。另一種方法是使用RoPE或類似的相對位置編碼進行差值，對已經訓練完成的模型進行短暫的微調，以擴展序列長度。這些方法使得大模型具備了一定的長序列建模能力，但訓練和推理的開銷並未減少。

OpenNLPLab團隊開源了一種名為Lightning Attention-2的新型線性注意力機制，旨在解決大語言模型長序列問題。這種機制使得訓練和推理長序列與1K序列長度的成本保持一致，從而實現了一勞永逸的解決方案。即使在遇到顯存瓶頸之前，增加序列長度也不會對模型訓練速度產生負面影響，因此可以實現無限長度的預訓練。此外，與1K Tokens相比，超長文本的推理成本也保持一致甚至更低，從而大大降低了當前大語言模型的推理成本。如下圖所示，當模型大小為400M、1B和3B時，隨著序列長度的增加，FlashAttention2加持的LLaMA的訓練速度開始快速下降，而Lightning Attention-2加持的TansNormerLLM的速度幾乎沒有變化。

新一代注意力机制Lightning Attention-2：无限序列长度、恒定算力开销、更高建模精度

圖1

新一代注意力机制Lightning Attention-2：无限序列长度、恒定算力开销、更高建模精度

#論文：Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models
##論文地址：https://arxiv.org/pdf/ 2401.04658.pdf
開源位址：https://github.com/OpenNLPLab/lightning-attention

#Lightning Attention-2 簡介

讓大模型的預訓練速度在不同序列長度下保持一致聽起來是一個不可能的任務。然而，自從2020年線性注意力橫空出世以來，研究人員一直在努力使線性注意力的實際效率符合其理論線性計算複雜度。在2023年中期之前，關於線性注意力的研究主要集中在與Transformer架構的精確度對齊。終於，在改進的線性注意力機制問世後，它在精度上能夠與最先進的Transformer架構相媲美。然而，線性注意力中最關鍵的「左乘變右乘」的計算trick在實際實現中遠慢於直接左乘的演算法。這是因為右乘的實作需要使用包含大量循環操作的累積求和（cumsum），而大量的I/O操作使得右乘的效率遠低於左乘。因此，要讓大模型的預訓練速度在不同序列長度下保持一致，仍面臨挑戰。研究人員需要進一步探索和改進線性注意力的實現方式，以提高其計算效率並減少I/O操作。這將有助於實現預訓練速度的一致性，從而更好地應對不同序列長度的任務需求。

新一代注意力机制Lightning Attention-2：无限序列长度、恒定算力开销、更高建模精度

圖2

#為了更好的理解Lightning Attention-2 的思路，讓我們先回顧下傳統softmax attention 的計算公式：O=softmax ((QK^T)⊙M_) V，其中Q, K, V, M, O 分別為query, key, value, mask 和輸出矩陣，這裡的M 在單向任務（如GPT）中是一個下三角的全1 矩陣，在雙向任務（如Bert）中則可以忽略，即雙向任務沒有mask 矩陣。

作者將 Lightning Attention-2 的整體想法總結為以下三點來解釋：

1. Linear Attention 的核心思想之一就是去除了計算成本高昂的 softmax 算子，使 Attention 的計算公式可以寫為 O=((QK^T)⊙M_) V。但由於單向任務中 mask 矩陣 M 的存在，使得該形式依然只能進行左乘計算，因此無法獲得 O (N) 的複雜度。但對於雙向任務，由於沒有沒有 mask 矩陣，Linear Attention 的計算公式可以進一步簡化為 O=(QK^T) V。 Linear Attention 的精妙之處在於，僅僅利用簡單的矩陣乘法結合律，其計算公式就可以進一步轉化為：O=Q (K^T V)，這種計算形式被稱為右乘，相對應的前者為左乘。透過圖 2 可以直觀地理解到 Linear Attention 在雙向任務中可以達到誘人的 O (N) 複雜度！

2. 但隨著decoder-only 的GPT 形式的模型逐漸成為LLM 的事實標準，如何利用Linear Attention 的右乘特性加速單向任務成為了亟待解決的難題。為了解決這個問題，本文作者提出了利用「分而治之」的思想，將注意力矩陣的計算分為對角陣和非對角陣兩種形式，並採用不同的方式對他們進行計算。如圖 3 所示，Linear Attention-2 利用電腦領域常用的 Tiling 思想，將 Q, K, V 矩陣分別切分為了相同數量的區塊 (blocks)。其中block 自身（intra-block）的計算由於mask 矩陣的存在，依然保留左乘計算的方式，具有O (N^2) 的複雜度；而block 之間（inter-block）的計算由於沒有mask 矩陣的存在，可以採用右乘計算方式，從而享受到O (N) 的複雜度。兩者分別計算完成後，可以直接相加得到對應第 i 塊的 Linear Attention 輸出 Oi。同時，透過 cumsum 對 KV 的狀態進行累積以在下一個 block 的計算中使用。這樣就得到了整個 Lightning Attention-2 的演算法複雜度為 intra-block 的 O (N^2) 和 inter-block 的 O (N) 的 Trade-off。怎麼取得更好的 Trade-off 則是由 Tiling 的 block size 決定的。

3. 細心的讀者會發現，以上的過程只是Lightning Attention-2 的演算法部分，之所以取名為Lightning 是因為作者充分考慮了演算法過程在GPU 硬體執行過程中的效率問題。受到FlashAttention 系列工作的啟發，實際在GPU 上進行計算的時候，作者將切分後的Q_i, K_i, V_i 張量從GPU 內部速度更慢容量更大的HBM 搬運到速度更快容量更小的SRAM上進行計算，從而減少大量的memory IO 開銷。當該 block 完成 Linear Attention 的計算之後，其輸出結果 O_i 又會被搬回至 HBM。重複這個過程直到所有 block 處理完畢即可。

想要了解更多細節的讀者可以仔細閱讀本文中的 Algorithm 1 和 Algorithm 2，以及論文中的詳細推導過程。 Algorithm 以及推導過程都對 Lightning Attention-2 的前向和反向過程進行了區分，可以幫助讀者有更深入的理解。

新一代注意力机制Lightning Attention-2：无限序列长度、恒定算力开销、更高建模精度

圖3

新一代注意力机制Lightning Attention-2：无限序列长度、恒定算力开销、更高建模精度

#Lightning Attention-2 精確度對比

研究人員首先在小規模（400M）參數模型上對比了Lightning Attention-2 與Lightning Attention-1 的精確度差異，如下圖所示，二者幾無差別。

新一代注意力机制Lightning Attention-2：无限序列长度、恒定算力开销、更高建模精度

隨後研究人員在1B、3B 上將Lightning Attention-2 加持的TransNormerLLM（TNL-LA2）與其它先進的非Transformer 架構的網路以及FlashAttention2 加持的LLaMA 在相同的語料下做了對比。如下圖所示，TNL-LA2 與 LLaMA 保持了相似的趨勢，且 loss 的表現更優。這個實驗表明，Lightning Attention-2 在語言建模方面有著不遜於最先進的 Transformer 架構的精度表現。

新一代注意力机制Lightning Attention-2：无限序列长度、恒定算力开销、更高建模精度

In the large language model task, the researchers compared the results of TNL-LA2 15B and Pythia on common benchmarks for large models of similar size. As shown in the table below, under the condition of eating the same tokens, TNL-LA2 is slightly higher than the Pythia model based on Softmax attention in common sense reasoning and multiple choice comprehensive capabilities.

新一代注意力机制Lightning Attention-2：无限序列长度、恒定算力开销、更高建模精度

Lightning Attention-2 speed comparison

The researchers conducted a comparison between Lightning Attention-2 and FlashAttention2 Comparison of single module speed and memory usage. As shown in the figure below, compared to Lightning Attention-1 and FlashAttention2, Lightning Attention-2 shows a strict linear increase in speed compared to the sequence length. In terms of memory usage, all three show similar trends, but Lightning Attention-2 has a smaller memory footprint. The reason for this is that the memory usage of FlashAttention2 and Lightning Attention-1 is also approximately linear.

新一代注意力机制Lightning Attention-2：无限序列长度、恒定算力开销、更高建模精度

The author noticed that the main focus of this article is to solve the training speed of linear attention network and realize long sequences of arbitrary length. Similar training speed to 1K sequences. In terms of inference speed, there is not much introduction. This is because linear attention can be losslessly converted to RNN mode during reasoning, thereby achieving a similar effect, that is, the speed of reasoning for a single token is constant. For Transformer, the inference speed of the current token is related to the number of tokens before it.

The author tested the comparison of the inference speed between TransNormerLLM-7B supported by Lightning Attention-1 and the common 7B model. As shown in the figure below, under the approximate parameter size, the throughput speed of Lightning Attention-1 is 4 times that of Baichuan and more than 3.5 times that of ChatGLM, showing an excellent inference speed advantage.

新一代注意力机制Lightning Attention-2：无限序列长度、恒定算力开销、更高建模精度

Summary

Lightning Attention-2 represents a major advancement in linear attention mechanisms, making it It can perfectly replace the traditional Softmax attention in terms of accuracy and speed, providing sustainable expansion capabilities for larger and larger models in the future, and providing a way to process infinitely long sequences with higher efficiency. The OpenNLPLab team will study sequential parallel algorithms based on linear attention mechanisms in the future to solve the currently encountered memory barrier problem.

以上是新一代注意力機制Lightning Attention-2：無限序列長度、恆定算力開銷、更高建模精度的詳細內容。更多資訊請關注PHP中文網其他相關文章！