NetEase’s open source inference acceleration framework for transformer-based models supports single-card high-performance inference of tens of billions of models on the mid- to low-end Ampere architecture.
Transformer-based large-scale models have proven effective in a variety of tasks in many fields. However, applying it to industrial production requires considerable effort to reduce the inference cost. To fill this gap, we propose a scalable inference solution: Easy and Efficient Transformer (EET). EET is a system that includes a series of Transformer reasoning optimizations at the algorithm and implementation levels. By optimizing the calculation and data processes of Transformer, EET can significantly reduce the cost of inference and improve the efficiency and performance of the model. Our experimental results show that EET can significantly improve inference speed and resource utilization without losing model accuracy, providing a simple and effective solution for large-scale model applications in industrial production.
First, we design a highly optimized kernel for long inputs and large hidden sizes.
In addition, we also propose a flexible CUDA memory manager to reduce the memory footprint when deploying large models. Compared with the state-of-the-art Transformer inference library (Faster Transformer v4.0), EET is able to achieve an average 1.40-4.20x decoding layer acceleration on the A100 GPU.
https://arxiv.org/abs/2104.12470
https://github.com/NetEase-FuXi /EET
The above is the detailed content of Easy and Efficient Transformer (NetEase's ultra-large model online inference engine). For more information, please follow other related articles on the PHP Chinese website!