MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

MHLA: Restoring Expressivity of Linear Attention via
Token-Level Multi-Head

Kewei Zhang^1*, Ye Huang^1*, Yufan Deng¹, Jincheng Yu², Junsong Chen²,
Huan Ling², Enze Xie², Daquan Zhou¹

¹Peking University ²NVIDIA
^*Indicates Equal Contribution

Accepted by ICLR 2026

Paper Code Hugging Face

Images and videos generated by our 600M and 1.3B models equipped with MHLA.

Video demonstration of MHLA.

Performance and efficiency of MHLA at a glance.

A Universal High-Efficiency Linear Attention Operator

Click to view details ↓

Image Classification

Performance +3.6%

Complexity Linear

Click to view details ↓

Image Generation

Performance +12.6%

Complexity Linear

Click to view details ↓

Language Modeling

Performance +6.3%

Complexity Linear

Click to view details ↓

Video Generation

Performance +41%

Complexity Linear

Why MHLA?

Key Features that Set MHLA Apart

Optimal Efficiency

For sequence lengths >1k,
MHLA surpasses Flash Attention in speed. Maintains identical efficiency with vanilla linear attention with zero overhead。

Flexible Attention Forms

Natively supports both causal and non-causal attention modes with built-in chunkwise training compatibility for enhanced flexibility.

Token-Level Diversity

Introduces diversity at the token level to break global context collapse in linear attention, unlocking significant performance gains.

About MHLA

While the Transformer architecture dominates many fields, its quadratic self-attention complexity hinders its use in large-scale applications. Linear attention offers an efficient alternative, but its direct application often degrades performance, with existing fixes typically re-introducing computational overhead through extra modules (e.g., depthwise separable convolution and few self-attention blocks) that defeat the original purpose. In this work, we identify a key failure mode in these methods: global context collapse, where the model loses representational diversity. To address this, we propose Multi-Head Linear Attention (MHLA), which preserves this diversity by computing attention within divided heads along the token dimension. We prove that MHLA maintains linear complexity while recovering much of the expressive power of softmax attention, and verify its effectiveness across multiple domains, achieving a 3.6% improvement on ImageNet classification, a 6.3% gain on NLP, a 12.6% improvement in image generation tasks and a 41% enhancement in video generation tasks with the same computational complexity.

Image Generation

8.2× speed up compared with Flash Attention

Speed Comparison (Throughput)

MHLA

7.40

Flash Attn

0.90

8.6 points better than Flash Attention

Quality Comparison (FID↓)

MHLA

59.8

Flash Attn

68.4

Images generated with SANA-MHLA (600M)

Video Generation

MHLA achieves the same VBench score as Flash Attention

VBench Score Comparison

Wan2.1 1.3B

83.31

Full Linear

58.24

Full MHLA

82.83

Hybrid 2/3 MHLA

83.82

2.2× faster than Flash Attention

Latency Comparison (Lower is Better)

Wan2.1 1.3B

139s

Full Linear

62s

Full MHLA

62s

Hybrid 2/3 MHLA

84s

Videos generated Our 1.3B model derived from Wan2.1.

Image Classification

MHLA outperforms Self Attn across multiple resolutions

Accuracy Comparison on DeiT-T (ACC %)

224×224

SA

72.2

MHLA

75.8

384×384

SA

74.4

MHLA

77.5

512×512

SA

75.3

MHLA

78.3

Language Modeling

MHLA leads in both CSR and LongBench benchmarks

CSR Average

GLA

46.0

Transformer++

46.8

Mamba

46.4

Mamba2

47.0

GDN

46.9

MHLA

47.1

LongBench Average

GLA

6.53

Transformer++

6.92

Mamba

6.97

Mamba2

6.62

GDN

6.86

MHLA

7.41

BibTeX

@misc{mhla,
      title={MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head}, 
      author={Kewei Zhang and Ye Huang and Yufan Deng and Jincheng Yu and Junsong Chen and Huan Ling and Enze Xie and Daquan Zhou},
      year={2026},
      eprint={2601.07832},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.07832}, 
}