Your ViT is Secretly an Image Segmentation Model

CVPR 2025 · Highlight Paper

Tommie Kerssies¹, Niccolò Cavagnero^2,*, Alexander Hermans³, Narges Norouzi¹, Giuseppe Averta², Bastian Leibe³, Gijs Dubbelman¹, Daan de Geus^1,3

¹ Eindhoven University of Technology, ² Polytechnic of Turin, ³ RWTH Aachen University

* Work done while visiting RWTH.

Paper Code

[1] Bowen Cheng et al., Masked-attention Mask Transformer for Universal Image Segmentation, CVPR 2022. [2] Zhe Chen et al., Vision Transformer Adapter for Dense Predictions, ICLR 2023.

Overview

We present the Encoder-only Mask Transformer (EoMT), a minimalist image segmentation model that repurposes a plain Vision Transformer (ViT) to jointly encode image patches and segmentation queries as tokens. No adapters. No decoders. Just the ViT.

By leveraging large-scale pre-trained ViTs, EoMT achieves accuracy similar to state-of-the-art methods that rely on complex, task-specific components. At the same time, it is significantly faster thanks to its simplicity, for example up to 4× faster with ViT-L.

Turns out, your ViT is secretly an image segmentation model. EoMT shows that architectural complexity isn’t necessary. For segmentation, a plain Transformer is all you need.

Overview

Encoder-only Mask Transformer (EoMT)

Citation