Length Generalization of Causal Transformers without Position Encoding

Jie Wang, Tao Ji, Yuanbin Wu, Hang Yan, Tao Gui, Qi Zhang et al.

2024 Annual Meeting of the Association for Computational Linguistics Cited 50 times

Abstract

Generalizing to longer sentences is important for recent Transformer-based language models. Besides algorithms manipulating explicit position features, the success of Transformers without position encodings (NoPE) provides a new way to overcome the challenge. In this paper, we study the length generalization property of NoPE. We find that although NoPE can extend to longer sequences than the commonly used explicit position encodings, it still has a limited context length. We identify a connection between the failure of NoPE's generalization and the distraction of attention distributions. We propose a parameter-efficient tuning for searching attention heads' best temperature hyper-parameters, which substantially expands NoPE's context size. Experiments on long sequence language modeling, the synthetic passkey retrieval task and real-world long context tasks show that NoPE can achieve competitive performances with state-of-the-art length generalization algorithms. The source code is publicly accessible

Cited in this thesis

Literature Survey Fish Species and Part Identification

Frequently Cited Together

Adaptive mixtures of local expertsJacobs 19912 chapters
Deep LearningGoodfellow 20162 chapters
Mamba: Linear-time sequence modeling with selective state spacesGu 20232 chapters
Towards Real-Time Industry-Proof Pork Breed and Boar Taint Classification Using Gkarane 20252 chapters
Grad-cam: Visual explanations from deep networks via gradient-based localizationSelvaraju 20172 chapters
Deep residual learning for image recognitionHe 20162 chapters

BibTeX

@article{Wang2024,
  title = {Length generalization of causal transformers without position encoding},
  author = {Wang, Jie and Ji, Tao and Wu, Yuanbin and Yan, Hang and Gui, Tao and Zhang, Qi and Huang, Xuanjing and Wang, Xiaoling},
  journal = {arXiv preprint arXiv:2404.12224},
  year = {2024},
}