Enhancing Language Models with LM-Infinite for Better Length Generalization
In a recent paper titled “LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models” (accessible here), the authors address a key limitation of large language models (LLMs) - their inability to handle long sequences of text effectively. Despite the impressive advancements in natural language generation, LLMs still struggle when it comes to generalizing sequence lengths beyond their training corpus. The paper proposes a simple and efficient solution called LM-Infinite that enables on-the-fly length generalization without the need for retraining.
The Challenge with Long Text Sequences
Most training schemes for LLMs limit sequence lengths to a fixed size to control costs. However, this leads to degradation and incoherent text when models encounter longer contexts during inference. Even advancements like relative position encodings, which were designed to mitigate this issue, still fail to generalize length in practice.
The paper identifies three key factors that limit length generalization:
1. Unseen Long Distances: On very long sequences, some distances far exceed those seen during training, leading to exploding attention logits as the model tries to distinguish new distances.
2. Unseen Number of Tokens: Longer contexts increase the number of tokens attended to, which dilutes attention and causes high entropy, resulting in information loss.
3. Implicitly Encoded Absolute Position: Earlier transformer layers seem to implicitly encode some absolute position information. When sequence length increases, this encoding for initial tokens gets distorted.
Introducing LM-Infinite
To tackle these issues, the authors propose two simple modifications:
1. Λ-Shaped Attention Mask: This limits the tokens attended to preserve recent local context, while always attending to initial salient tokens. This maintains some position encoding and prevents attention dilution.
2. Bounding Relative Distances: This clips effective distances during attention to the maximum training length, capping exploding logits from unseen long distances.
These principles make LM-Infinite model-agnostic. It can be applied to any LLM using relative position encodings like RoPE or Alibi without retraining.
Impressive Results
LM-Infinite was tested on a variety of LLMs including LLaMA, GPT-J, and MPT-7B on ArXiv and OpenWebText datasets. The results were impressive:
- Perplexity remained stable at lengths up to 32k tokens, 3-16x longer than the training corpus, indicating maintained fluency.
- BLEU and ROUGE scores also stayed consistent, with quality comparable to or better than fine-tuned models.
- On passkey retrieval with distractors, accuracy degraded slower than baseline models, extending coherent generation.
- Encoding and decoding were sped up by 3.16x and 2.72x respectively on length 32k with no drop in quality.
In Conclusion
This paper has identified the key limitations of relative position encodings and introduced LM-Infinite to address these issues through attention masking and distance bounding. This simple technique demonstrated consistent fluency and performance on a variety of LLMs at lengths far exceeding the training corpus. LM-Infinite provides an effective and model-agnostic solution for length generalization, without requiring fine-tuning. This is a significant step forward in the field of natural language generation, and we look forward to seeing how future work can further improve information retention from masked content.
Get updates directly in your mailbox by signing up for our newsletter. Signup Now
Comments
Post a Comment