AR-Omni: A Unified Autoregressive Model for Any-to-Any Generation

One decoder, One token stream — unified AR generation for text, images, and speech.
Dongjie Cheng1*, Ruifeng Yuan1*, Yongqi Li1†, Runyang You1
Wenjie Wang2, Liqiang Nie3, Lei Zhang1, Wenjie Li1
1The Hong Kong Polytechnic University
2University of Science and Technology of China 3Harbin Institute of Technology (Shenzhen)
*Equal contribution Corresponding author

✨ TL;DR

AR-Omni is a single-decoder, single-token-stream autoregressive any-to-any model. It unifies multimodal generation as standard next-token prediction over interleaved sequences, and improves training/inference behavior with task-aware loss reweighting, token-level perceptual alignment for image tokens, and a finite-state decoding machine.

📘 Abstract

Real-world perception and interaction are inherently multimodal, encompassing not only language but also vision and speech, which motivates the development of “Omni” MLLMs that support both multimodal inputs and multimodal outputs. While a sequence of omni MLLMs has emerged, most existing systems still rely on additional expert components to achieve multimodal generation, limiting the simplicity of unified training and inference. Autoregressive (AR) modeling, with a single token stream, a single next-token objective, and a single decoder, is an elegant and scalable foundation in the text domain. Motivated by this, we present AR-Omni, a unified any-to-any model in the autoregressive paradigm without any expert decoders. AR-Omni supports autoregressive text and image generation, as well as streaming speech generation, all under a single Transformer decoder. We further address three practical issues in unified AR modeling: modality imbalance via task-aware loss reweighting, visual fidelity via a lightweight token-level perceptual alignment loss for image tokens, and stability-creativity trade-offs via a finite-state decoding mechanism. Empirically, AR-Omni achieves strong quality across three modalities while remaining real-time, achieving a 0.88 real-time factor for speech generation.

🧠 Architecture

AR-Omni overview
Overview of AR-Omni.

🧩 Demonstrations

• Example 1: Speech to Image

User

User:

AR-Omni-Chat

AR-Omni-Chat:

Speech to Image result

• Example 2: Image to Speech

User

User:

Input image:

Input image
AR-Omni-Chat

AR-Omni-Chat:

• Example 3: Multi-Turn Interleaved Conversation

User

User:

AR-Omni-Chat

AR-Omni-Chat:

User

User:

AR-Omni-Chat

AR-Omni-Chat:

Generated forest image
User

User:

AR-Omni-Chat

AR-Omni-Chat:

• Example 4: Multi-Turn Speech Conversation

User

User:

AR-Omni-Chat

AR-Omni-Chat:

User

User:

AR-Omni-Chat

AR-Omni-Chat:

User

User:

AR-Omni-Chat

AR-Omni-Chat:

User

User:

AR-Omni-Chat

AR-Omni-Chat:

📚 BibTeX

Please consider citing our work if you find it useful:
@misc{cheng2026aromniunifiedautoregressivemodel,
      title={AR-Omni: A Unified Autoregressive Model for Any-to-Any Generation}, 
      author={Dongjie Cheng and Ruifeng Yuan and Yongqi Li and Runyang You and Wenjie Wang and Liqiang Nie and Lei Zhang and Wenjie Li},
      year={2026},
      eprint={2601.17761},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2601.17761}, 
}

🙏 Acknowledgements

We thank the open-source projects and research community that made this work possible.

This project is licensed under the MIT License. It also complies with the licenses of referenced third-party projects and dependencies, including the Chameleon Research License.

🪐 AR-Omni · ModalityDance
Maintained by Dongjie Cheng. Updated on 2026-01-20.