On Layer Normalization in the Transformer Architecture

Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing et al.

2020 International Conference on Machine Learning Cited 1,347 times

Abstract

The Transformer is widely used in natural language processing tasks. To train a Transformer however, one usually needs a carefully designed learning rate warm-up stage, which is shown to be crucial to the final performance but will slow down the optimization and bring more hyperparameter tunings. In this paper, we first study theoretically why the learning rate warm-up stage is essential and show that the location of layer normalization matters. Specifically, we prove with mean field theory that at initialization, for the original-designed Post-LN Transformer, which places the layer normalization between the residual blocks, the expected gradients of the parameters near the output layer are large. Therefore, using a large learning rate on those gradients makes the training unstable. The warm-up stage is practically helpful for avoiding this problem. On the other hand, our theory also shows that if the layer normalization is put inside the residual blocks (recently proposed as Pre-LN Transformer), the gradients are well-behaved at initialization. This motivates us to remove the warm-up stage for the training of Pre-LN Transformers. We show in our experiments that Pre-LN Transformers without the warm-up stage can reach comparable results with baselines while requiring significantly less training time and hyper-parameter tuning on a wide range of applications.

Cited in this thesis

Fish Species and Part Identification

Frequently Cited Together

Generalization and parameter estimation in feedforward nets: Some experimentsMorgan 19891 chapter
Bert: Pre-training of deep bidirectional transformers for language understandingDevlin 20181 chapter
Idiot's Bayes—not so stupid after all?Hand 20011 chapter
Adaptive mixtures of local expertsJacobs 19911 chapter
Gaussian error linear units (gelus)Hendrycks 20161 chapter
Identification of biological tissues by rapid evaporative ionization mass spectrBalog 20101 chapter

BibTeX

@inproceedings{Xiong2020,
  title = {On layer normalization in the transformer architecture},
  author = {Xiong, Ruibin and Yang, Yunchang and He, Di and Zheng, Kai and Zheng, Shuxin and Xing, Chen and Zhang, Huishuai and Lan, Yanyan and Wang, Liwei and Liu, Tieyan},
  booktitle = {International conference on machine learning},
  pages = {10524–10533},
  year = {2020},
  organization = {PMLR},
}