Microsoft Research Releases the First FP4 Precision Large Model Training Framework

Significantly Improving Efficiency
On January 29, Microsoft Research released the first FP4 precision large model training framework. With the same hyperparameter settings, this framework achieves training results comparable to FP8 and BF16. This means fewer storage and computational resources are required.
Model Size and Training Performance
Models trained using this method can scale up to 13 billion parameters, with training tokens reaching the trillion-level. While FP4 is currently simulated via FP8, using true FP4 would further improve the performance.
FP4 Simulation and Training Method
Since native FP4 hardware was unavailable at the time of research, the team simulated FP4 using FP8 on TensorCore. Specifically, for LLaMA models of 1.3B, 7B, and 13B, the training from 0 to 10 million tokens shows loss curves in FP4 and BF16 are nearly identical.
To enable FP4 precision training, the team used a custom FP4 GeMM CUDA kernel. First, they load FP4’s A and B matrices into shared memory using FP16, perform transformations, and then perform block matrix multiplication using FP4. The intermediate results are reduced to produce the final FP16 format output matrix.
Quantization Strategy and Hardware Compatibility
The framework uses the E2M1 FP4 format, where 2 bits represent the exponent, 1 bit represents the mantissa, and 1 bit for the sign bit, totaling 4 bits. This format was chosen to fit the design of current mainstream ML acceleration chips. Different quantization strategies were applied to weight matrices (W) and activation matrices (A) to maximize the acceleration of FP4 in matrix multiplication.
Forward and Backward Propagation Innovations
During forward propagation, the framework quantizes the weight matrix W and activation matrix A for each linear layer simultaneously. During quantization, the values are scaled and shifted to fit the FP4 range, and then rounded to the nearest FP4 discrete value using a lookup table.
In backward propagation, a new differentiable gradient estimation method is proposed to ensure both computational efficiency and gradient accuracy. Additionally, a strategy of "outlier peak cutting and compensation" is introduced for handling outliers in activation matrices.
Mixed Precision Design and Application
The team employed mixed precision design in some parts, using FP8 for gradient communication, FP16 for optimizer state storage, and FP16 for other non-matrix operations like Loss Scaling. These designs reduce computational and storage overhead while ensuring numerical stability.
Research Team and Background
This framework was developed by Microsoft Research Asia and the SIGMA team, with all researchers being Chinese. The first author, Ruizhe Wang, is a doctoral student at the University of Science and Technology of China (USTC) and currently interned at MSRA, with research focused on low-precision quantization. Professor Zhenjun Zha, the Director of USTC's Research Department, also contributed to this project. The corresponding authors are Peng Cheng, Senior Research Manager, and Yeyun Gong, Chief Research Manager at MSRA. MSRA Distinguished Scientist Baining Guo also participated in the project.
Summary
The FP4 precision large model training framework released by Microsoft Research introduces innovative quantization strategies and mixed precision designs, resulting in higher efficiency in model training. The framework not only improves training efficiency but also significantly reduces the need for computational and storage resources, opening new pathways for large model training.
Paper Link: arXiv
Reference Link: Twitter