MHLA:
Restoring Expressivity of
Linear Attention via
Token-Level Multi-Head
Images and videos generated by our 600M and 1.3B models equipped with MHLA.
Video demonstration of MHLA.
Performance and efficiency of MHLA at a glance.
A Universal High-Efficiency Linear Attention Operator
Image Classification
Image Generation
Language Modeling
Video Generation
Why MHLA?
Key Features that Set MHLA Apart
Optimal Efficiency
For sequence lengths >1k,
MHLA surpasses Flash Attention in speed. Maintains identical efficiency with vanilla linear attention with zero overhead。
Flexible Attention Forms
Natively supports both causal and non-causal attention modes with built-in chunkwise training compatibility for enhanced flexibility.
Token-Level Diversity
Introduces diversity at the token level to break global context collapse in linear attention, unlocking significant performance gains.
About MHLA
While the Transformer architecture dominates many fields, its quadratic self-attention complexity hinders its use in large-scale applications. Linear attention offers an efficient alternative, but its direct application often degrades performance, with existing fixes typically re-introducing computational overhead through extra modules (e.g., depthwise separable convolution and few self-attention blocks) that defeat the original purpose. In this work, we identify a key failure mode in these methods: global context collapse, where the model loses representational diversity. To address this, we propose Multi-Head Linear Attention (MHLA), which preserves this diversity by computing attention within divided heads along the token dimension. We prove that MHLA maintains linear complexity while recovering much of the expressive power of softmax attention, and verify its effectiveness across multiple domains, achieving a 3.6% improvement on ImageNet classification, a 6.3% gain on NLP, a 12.6% improvement in image generation tasks and a 41% enhancement in video generation tasks with the same computational complexity.
Image Generation
8.2× speed up compared with Flash Attention
Speed Comparison (Throughput)











Images generated with SANA-MHLA (600M)
Video Generation
MHLA achieves the same VBench score as Flash Attention
VBench Score Comparison
2.2× faster than Flash Attention
Latency Comparison (Lower is Better)
Videos generated Our 1.3B model derived from Wan2.1.
Image Classification
MHLA outperforms Self Attn across multiple resolutions
Accuracy Comparison on DeiT-T (ACC %)
224×224
384×384
512×512
Language Modeling
MHLA leads in both CSR and LongBench benchmarks
CSR Average
LongBench Average
BibTeX
@misc{mhla,
title={MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head},
author={Kewei Zhang and Ye Huang and Yufan Deng and Jincheng Yu and Junsong Chen and Huan Ling and Enze Xie and Daquan Zhou},
year={2026},
eprint={2601.07832},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.07832},
}