Your ViT is Secretly an Image Segmentation Model

CVPR 2025 · Highlight Paper

1 Eindhoven University of Technology, 2 Polytechnic of Turin, 3 RWTH Aachen University

* Work done while visiting RWTH.

Architecture Diagram
Performance Plot

[1] Bowen Cheng et al., Masked-attention Mask Transformer for Universal Image Segmentation, CVPR 2022.   [2] Zhe Chen et al., Vision Transformer Adapter for Dense Predictions, ICLR 2023.

Overview

We present the Encoder-only Mask Transformer (EoMT), a minimalist image segmentation model that repurposes a plain Vision Transformer (ViT) to jointly encode image patches and segmentation queries as tokens. No adapters. No decoders. Just the ViT.

By leveraging large-scale pre-trained ViTs, EoMT achieves accuracy similar to state-of-the-art methods that rely on complex, task-specific components. At the same time, it is significantly faster thanks to its simplicity, for example up to 4× faster with ViT-L.

Turns out, your ViT is secretly an image segmentation model. EoMT shows that architectural complexity isn’t necessary, plain Transformer power is all you need.

Encoder-only Mask Transformer (EoMT)

Method Figure

Citation

@inproceedings{kerssies2025eomt,
  author     = {Kerssies, Tommie and Cavagnero, Niccolò and Hermans, Alexander and Norouzi, Narges and Averta, Giuseppe and Leibe, Bastian and Dubbelman, Gijs and de Geus, Daan},
  title      = {Your ViT is Secretly an Image Segmentation Model},
  booktitle  = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year       = {2025},
}