MHLA: Restoring Expressivity of Linear Attention via
Token-Level Multi-Head

1Peking University 2NVIDIA
*Indicates Equal Contribution

Video demonstration of MHLA.

MHLA preview

Performance and efficiency of MHLA at a glance.

A Universal High-Efficiency Linear Attention Operator

Click to view details ↓

Image Classification

Performance +3.6%
Complexity Linear
Click to view details ↓

Image Generation

Performance +12.6%
Complexity Linear
Click to view details ↓

Language Modeling

Performance +6.3%
Complexity Linear
Click to view details ↓

Video Generation

Performance +41%
Complexity Linear

Why MHLA?

Key Features that Set MHLA Apart

Optimal Efficiency

For sequence lengths >1k,
MHLA surpasses Flash Attention in speed. Maintains identical efficiency with vanilla linear attention with zero overhead

Flexible Attention Forms

Natively supports both causal and non-causal attention modes with built-in chunkwise training compatibility for enhanced flexibility.

Token-Level Diversity

Introduces diversity at the token level to break global context collapse in linear attention, unlocking significant performance gains.

About MHLA

MHLA pipeline overview

While the Transformer architecture dominates many fields, its quadratic self-attention complexity hinders its use in large-scale applications. Linear attention offers an efficient alternative, but its direct application often degrades performance, with existing fixes typically re-introducing computational overhead through extra modules (e.g., depthwise separable convolution and few self-attention blocks) that defeat the original purpose. In this work, we identify a key failure mode in these methods: global context collapse, where the model loses representational diversity. To address this, we propose Multi-Head Linear Attention (MHLA), which preserves this diversity by computing attention within divided heads along the token dimension. We prove that MHLA maintains linear complexity while recovering much of the expressive power of softmax attention, and verify its effectiveness across multiple domains, achieving a 3.6% improvement on ImageNet classification, a 6.3% gain on NLP, a 12.6% improvement in image generation tasks and a 41% enhancement in video generation tasks with the same computational complexity.

Video Generation

MHLA achieves the same VBench score as Flash Attention

VBench Score Comparison

2.2× faster than Flash Attention

Latency Comparison (Lower is Better)

Videos generated Our 1.3B model derived from Wan2.1.

BibTeX

@misc{mhla,
      title={MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head}, 
      author={Kewei Zhang and Ye Huang and Yufan Deng and Jincheng Yu and Junsong Chen and Huan Ling and Enze Xie and Daquan Zhou},
      year={2026},
      eprint={2601.07832},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.07832}, 
}