The technical details of the Byte Wanka cluster are disclosed: GPT-3 training was completed in 2 days, and the computing power utilization exceeded NVIDIA Megatron-LM-AI-php.cn

The technical details of the Byte Wanka cluster are disclosed: GPT-3 training was completed in 2 days, and the computing power utilization exceeded NVIDIA Megatron-LM

WBOY

Release： 2024-03-01 16:01:33

forward

739 people have browsed it

As the technical analysis of Sora unfolds, the importance of AI infrastructure becomes increasingly prominent.

A new paper from Byte and Peking University attracted attention at this time:

The article disclosed that the Wanka cluster built by Byte can be used in ## Complete the training of GPT-3 scale model (175B) within #1.75 days.

The technical details of the Byte Wanka cluster are disclosed: GPT-3 training was completed in 2 days, and the computing power utilization exceeded NVIDIA Megatron-LM

Specifically, Byte proposed a production system called

MegaScale, which aims to solve the problems faced when training large models on the Wanka cluster. efficiency and stability challenges.

When training a 175 billion parameter large language model on 12288 GPUs, MegaScale achieved a computing power utilization of 55.2%

(MFU) , which is 1.34 times that of NVIDIA Megatron-LM.

The paper also revealed that as of September 2023, Byte has established an Ampere architecture GPU

(A100/A800) cluster with more than 10,000 cards, and is currently building a large-scale Hopper architecture (H100/H800)Cluster.

Production system suitable for Wanka cluster

In the era of large models, the importance of GPU no longer needs elaboration.

But the training of large models cannot be started directly when the number of cards is full - when the scale of the GPU cluster reaches the "10,000" level, how to achieve

efficiency and stability# The training of ## is itself a challenging engineering problem.

The technical details of the Byte Wanka cluster are disclosed: GPT-3 training was completed in 2 days, and the computing power utilization exceeded NVIDIA Megatron-LM The first challenge: efficiency.

Training a large language model is not a simple parallel task. The model needs to be distributed among multiple GPUs, and these GPUs require frequent communication to jointly advance the training process. In addition to communication, factors such as operator optimization, data preprocessing and GPU memory consumption all have an impact on computing power utilization

(MFU)

, an indicator that measures training efficiency.

MFU is the ratio of actual throughput to theoretical maximum throughput.

The second challenge: stability.

We know that training large language models often takes a very long time, which also means that failures and delays during the training process are not uncommon.

The cost of failure is high, so how to shorten the fault recovery time becomes particularly important.

In order to address these challenges, ByteDance researchers built MegaScale and have deployed it in Byte's data center to support the training of various large models.

MegaScale is improved on the basis of NVIDIA Megatron-LM.

The technical details of the Byte Wanka cluster are disclosed: GPT-3 training was completed in 2 days, and the computing power utilization exceeded NVIDIA Megatron-LM # Specific improvements include the co-design of algorithms and system components, optimization of communication and computational overlap, operator optimization, data pipeline optimization, and network performance Tuning, etc.:

Algorithm optimization: Researchers introduced parallelized Transformer block, sliding window attention mechanism(SWA) and LAMB in the model architecture Optimizer to improve training efficiency without sacrificing model convergence.
Communication overlap: Based on a detailed analysis of the operations of each computing unit in 3D parallelism(data parallelism, pipeline parallelism, tensor parallelism) , the researchers designed technical strategies to effectively reduce the delay caused by operations on non-critical execution paths and shorten the iteration time of each round in model training.
Efficient operators: The GEMM operator has been optimized, and operations such as LayerNorm and GeLU have been integrated to reduce the overhead of starting multiple cores and optimize memory access patterns.
Data pipeline optimization: Optimize data preprocessing and loading and reduce GPU idle time through asynchronous data preprocessing and elimination of redundant data loaders.
Collective communication group initialization: Optimized the initialization process of NVIDIA multi-card communication framework NCCL in distributed training. Without optimization, the initialization time of a 2048-GPU cluster is 1047 seconds, which can be reduced to less than 5 seconds after optimization; the initialization time of a Wanka GPU cluster can be reduced to less than 30 seconds.
Network Performance Tuning: Analyze inter-machine traffic in 3D parallelism, and design technical solutions to improve network performance, including network topology design, ECMP hash conflict reduction, and congestion control and retransmission timeout settings.
Fault Tolerance: In Wanka cluster, software and hardware failures are unavoidable. The researchers designed a training framework to achieve automatic fault identification and rapid recovery. Specifically, it includes developing diagnostic tools to monitor system components and events, optimizing checkpoint high-frequency saving training processes, etc.

The paper mentioned that MegaScale can automatically detect and repair more than 90% of software and hardware failures.

The technical details of the Byte Wanka cluster are disclosed: GPT-3 training was completed in 2 days, and the computing power utilization exceeded NVIDIA Megatron-LM

Experimental results show that MegaScale achieved 55.2% MFU when training a 175B large language model on 12288 GPUs, which is the Megatrion-LM algorithm 1.34 times the power utilization rate.

The MFU comparison results of training a 530B large language model are as follows:

The technical details of the Byte Wanka cluster are disclosed: GPT-3 training was completed in 2 days, and the computing power utilization exceeded NVIDIA Megatron-LM

One More Thing

Just after this technical paper triggered a discussion Recently, new news has also come out about the byte-based Sora product:

Jiangying's Sora-like AI video tool has launched an invitation-only beta test.

The technical details of the Byte Wanka cluster are disclosed: GPT-3 training was completed in 2 days, and the computing power utilization exceeded NVIDIA Megatron-LM

#It seems that the foundation has been laid, so are you looking forward to Byte's large model products?

Paper address: https://arxiv.org/abs/2402.15627

The above is the detailed content of The technical details of the Byte Wanka cluster are disclosed: GPT-3 training was completed in 2 days, and the computing power utilization exceeded NVIDIA Megatron-LM. For more information, please follow other related articles on the PHP Chinese website!