Graph optimization plays an important role in reducing the time and resources used by training and inference of AI models. An important function of graph optimization is to fuse operators that can be fused in the model, thereby improving computing efficiency by reducing memory usage and data transfer in low-speed memory. However, it is very difficult to implement a back-end solution that can provide various operator fusions, resulting in very limited operator fusions that can be used by AI models on actual hardware.
The Composable Kernel (CK) library aims to provide a set of back-end solutions for operator fusion on AMD GPUs. CK uses the general-purpose programming language HIP C and is completely open source. Its design concepts include:
CK introduces two concepts to improve the productivity of back-end developers:
1. The groundbreaking introduction of "Tensor Coordinate Transformation" Reduce the complexity of writing AI operators. This research pioneered the definition of a set of reusable Tensor Coordinate Transformation basic modules, and used them to re-express complex AI operators (such as convolution, group normalization reduction, Depth2Space, etc.) in a mathematically rigorous way into The most basic AI operators (GEMM, 2D reduction, tensor transfer, etc.). This technology allows algorithms written for basic AI operators to be directly used on all corresponding complex AI operators without having to rewrite the algorithm.
2. Tile-based programming paradigm: Developing the back-end algorithm for operator fusion can be seen as first disassembling each pre-fusion operator (independent operator) into many "Small piece" data operations, and then combine these "small piece" operations into fused operators. Each such "small block" operation corresponds to an original independent operator, but the data being operated is only a part (tile) of the original tensor, so such "small block" operation is called a Tile Tensor Operator. The CK library contains a set of highly optimized implementations of Tile Tensor Operator, and all AI independent operators and fusion operators in CK are implemented using them. Currently, these Tile Tensor Operators include Tile GEMM, Tile Reduction and Tile Tensor Transfer. Each Tile Tensor Operator has implementations for GPU thread blocks, warps and threads.
Tensor Coordinate Transformation and Tile Tensor Operator together form the reusable basic module of CK.
Figure 1, using CK’s Tensor Coordinate Transformation basic module to express the convolution operator into a GEMM operator
Figure 2, the composition of CK (bottom: reusable basic modules; top: independent operators and fusion operators)
The CK library structure is divided into four layers, from bottom to top: Templated Tile Operator, Templated Kernel and Invoker, Instantiated Kernel and Invoker and Client API [3]. Each layer corresponds to different developers.
##Figure 3, CK library four-layer structure
End-to-end model inference based on AITemplate CKMeta’s AITemplate [7] (AIT) is an AI inference system that unifies AMD and Nvidia GPUs. AITemplate uses CK as its backend on AMD GPUs, using CK's Templated Kernel and Invoker layer.
AITemplate CK achieves state-of-the-art inference performance on multiple important AI models on the AMD Instinct™ MI250. The definition of most advanced fusion operators in CK is driven by the vision of the AITemplate team. Many fusion operator algorithms are also jointly designed by the CK and AITemplate teams.
This article compares the performance of several end-to-end models on AMD Instinct MI250 and similar products [8]. All performance data of the AMD Instinct MI250 AI model in this article were obtained using AITemplate[9] CK[10].
ExperimentalResNet-50
The image below shows AIT on AMD Instinct MI250 Performance comparison of CK with TensorRT v8.5.0.12 [11] (TRT) on A100-PCIe-40GB and A100-DGX-80GB. The results show that AIT CK on AMD Instinct MI250 achieved a 1.08x acceleration compared to TRT on A100-PCIe-40GB.
BERT
A Batched GEMM Softmax GEMM fusion operator template implemented based on CK can completely eliminate the transfer of intermediate results between the GPU Compute Unit (Compute Unit) and HBM. By using this fusion operator template, many problems in the attention layer that were originally bandwidth bound have become computational bottlenecks (compute bound), which can better utilize the computing power of the GPU. This CK implementation is deeply inspired by FlashAttention [12] and reduces more data handling than the original FlashAttention implementation.
The following figure shows AIT CK on AMD Instinct MI250 with FasterTransformer v5.1.1 bug fix [13] (FT) on A100-PCIe-40GB and A100-DGX-80GB Performance comparison of Bert Base model (uncased). FT runs out of GPU memory at Batch 32 on A100-PCIe-40GB and A100-DGX-80GB when Sequence is 4096. Therefore, when Sequence is 4096, this article only shows the results of Batch 16. The results show that AIT CK on AMD Instinct MI250 achieves 3.28x FT acceleration compared to FT on A100-PCIe-40GB, and 2.91x FT speedup compared to A100-DGX-80GB.
Vision Transformer (VIT)
The image below shows the AMD Instinct Performance comparison of AIT CK on MI250 with Vision Transformer Base (224x224 image) of TensorRT v8.5.0.12 (TRT) on A100-PCIe-40GB and A100-DGX-80GB. The results show that AIT CK on AMD Instinct MI250 achieves a 1.8x speedup compared to the TRT on A100-PCIe-40GB, and a 1.4x speedup compared to the TRT on A100-DGX-80GB.
##Stable Diffusion
End-to-end Stable Diffusion
The following table shows the performance data of AIT CK Stable Diffusion end-to-end (Batch 1, 2, 4, 6) on AMD Instinct MI250. When Batch is 1, only one GCD is used on MI250, while in Batch 2, 4, and 6, both GCDs are used.
UNet in Stable Diffusion
#However, this article does not yet talk about using TensorRT to run Stable Diffusion end-to-end Public information about the end model. But this article "Make stable diffusion 25% faster using TensorRT" [14] explains how to use TensorRT to accelerate the UNet model in Stable Diffusion. UNet is the most important and time-consuming part of Stable Diffusion, so the performance of UNet roughly reflects the performance of Stable Diffusion.
The graph below shows the performance of AIT CK on AMD Instinct MI250 versus UNet on A100-PCIe-40GB and A100-DGX-80GB with TensorRT v8.5.0.12 (TRT) Compare. The results show that AIT CK on AMD Instinct MI250 achieves a 2.45x speedup compared to the TRT on A100-PCIe-40GB, and a 2.03x speedup compared to the TRT on A100-DGX-80GB.
More information
##ROCm webpage: AMD ROCm™ Open Software Platform | AMD##ROCm Information Portal: AMD Documentation - Portal
AMD Instinct Accelerators: AMD Instinct™ Accelerators | AMD
AMD Infinity Hub: AMD Infinity Hub | AMD
Endnotes:
##1. Chao Liu is PMTS Software Development Engineer at AMD. Jing Zhang is SMTS Software Development Engineer at AMD. Their postings are their own opinions and may not represent AMD's positions, strategies, or opinions. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied. GD-5
2.CK for CPU is in early development phase. 3.C APIs for now, Python APIs are under planning. 4.Example of CK “Client API” for GEMM Add Add FastGeLU fused operator. https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/685860c2a9483c9e909d2f8bfb95056672491... 5.Example of CK “Templated Kernel and Invoker” of GEMM Add Add FastGeLU fuse operator. https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/685860c2a9483c9e909d2f8bfb95056672491... 6.Example of using CK “Templated Tile Operator” primitives to write a GEMM pipeline. https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/685860c2a9483c9e909d2f8bfb95056672491... 7.Meta’s AITemplate GitHub repository. https://github.com/facebookincubator/AITemplate 8.MI200-71: Testing Conducted by AMD MLSE 10.23.22 using AITemplate https://github.com/ROCmSoftwarePlatform/AITemplate, commit f940d9b) Composable Kernel https://github.com/ROCmSoftwarePlatform/composable_kernel, commit 40942b9) with ROCm™5.3 running on 2x AMD EPYC 7713 64-Core Processor server with 4x AMD Instinct MI250 OAM (128 GB HBM2e) 560W GPU with AMD Infinity Fabric™ technology vs. TensorRT v8.5.0.12 and FasterTransformer (v5.1.1 bug fix) with CUDA® 11.8 running on 2x AMD EPYC 7742 64-Core Processor server with 4x Nvidia A100-PCIe-40GB (250W) GPU and TensorRT v8.5.0.12 and FasterTransformer (v5.1.1 bug fix) with CUDA® 11.8 running on 2xAMD EPYC 7742 64-Core Processor server with 8x NVIDIA A100 SXM 80GB (400W) GPU. Server manufacturers may vary configurations, yielding different results. Performance may vary based on factors including use of latest drivers and optimizations. 9.https://github.com/ROCmSoftwarePlatform/AITemplate/tree/f940d9b7ac8b976fba127e2c269dc5b368f30e4e 10.https://github.com/ROCmSoftwarePlatform/composable_kernel/tree/40942b909801dd721769834fc61ad201b5795... 11.TensorRT GitHub repository. https://github.com/NVIDIA/TensorRT 12.FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. https://arxiv.org/abs/2205.14135 13.FasterTransformer GitHub repository. https://github.com/NVIDIA/FasterTransformer 14.Making stable diffusion 25% faster using TensorRT. https://www.photoroom.com/tech/stable-diffusion-25-percent-faster-and-save-seconds/ 15.During their time in AMD
The above is the detailed content of Improve AI end-to-end performance through customized operator fusion. For more information, please follow other related articles on the PHP Chinese website!