This repo contains an official PyTorch implementation of VIRTUE: Visual-Interactive Text-Image Universal Embedder by Wei-Yao Wang, Kazuya Tateishi, Qiyu Wu, Shusuke Takahashi, Yuki Mitsufuji.
Multimodal representation learning models have demonstrated successful operation across complex tasks, and the integration of vision-language models (VLMs) has further enabled embedding models with instruction-following capabilities. However, existing embedding models lack visual-interactive capabilities to specify regions of interests from users (e.g., point, bounding box, mask), which have been explored in generative models to broaden their human-interactive applicability. Equipping embedding models with visual interactions not only would unlock new applications with localized grounding of user intent, which remains unexplored, but also enable the models to learn entity-level information within images to complement their global representations for conventional embedding tasks. In this paper, we propose a novel Visual-InteRactive Text-image Universal Embedder (VIRTUE) that extends the capabilities of the segmentation model and the vision-language model to the realm of representation learning. In VIRTUE, the segmentation model can process visual prompts that pinpoint specific regions within an image, thereby enabling the embedder to handle complex and ambiguous scenarios more precisely. To evaluate the visual-interaction ability of VIRTUE, we introduce a large-scale Segmentation-and-Scene C}ption Retrieval (SCaR) benchmark comprising 1M samples that aims to retrieve the text caption by jointly considering the entity with a specific object and image scene. VIRTUE consistently achieves a state-of-the-art performance with significant improvements across 36 universal MMEB (3.1%–8.5%) and five visual-interactive SCaR (15.2%–20.3%) tasks.
Please refer to the example code.
python3 virtue-example.py
.
├── virtue (this repo)
│ ├── assets
│ │ ├── example.jpg
│ │ └── teaser.png
│ ├── codes
│ │ ├── configs
│ │ ├── demo.py
│ │ ├── eval.py
│ │ ├── HF_model_conversion
│ │ ├── requirements.txt
│ │ ├── sam2_checkpoints
│ │ ├── scripts
│ │ ├── src
│ │ ├── train.py
│ │ └── virtue-example.py
│ ├── docker
│ │ └── virtue-env.Dockerfile
│ ├── docs
│ │ ├── index.html
│ │ └── static
│ └── README.md
├── data
│ ├── MMEB-train
│ ├── SCaR-eval
│ │ ├── images
│ │ │ ├── ade20k_val
│ │ │ ├── coco_stuff_val
│ │ │ ├── refcocog_val
│ │ │ ├── refcoco_plus_val
│ │ │ └── visualgenome_val
│ │ ├── SCaR_eval_ADE20K.parquet
│ │ ├── SCaR_eval_COCO_Stuff.parquet
│ │ ├── SCaR_eval_RefCOCOg.parquet
│ │ ├── SCaR_eval_RefCOCO_plus.parquet
│ │ └── SCaR_eval_VisualGenome.parquet
│ └── SCaR-train
│ ├── images
│ │ ├── ade20k_train
│ │ ├── coco_stuff_train
│ │ ├── refcocog_train
│ │ ├── refcoco_plus_train
│ │ └── visualgenome_train
│ ├── SCAR_ADE20K.parquet
│ ├── SCAR_COCO_Stuff.parquet
│ ├── SCAR_RefCOCOg.parquet
│ ├── SCAR_RefCOCO_plus.parquet
│ └── SCAR_VisualGenome.parquet
The environment is set with Python3.11.
conda create -n virtue python=3.11 -y
conda activate virtue
conda install pytorch=2.5 torchvision torchaudio pytorch-cuda=12.1 "mkl>=2023,<2025" -c pytorch -c nvidia -y
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.0.post2/flash_attn-2.7.0.post2+cu12torch2.5cxx11abiFALSE-cp311-cp311-linux_x86_64.whl && pip install flash_attn-2.7.0.post2+cu12torch2.5cxx11abiFALSE-cp311-cp311-linux_x86_64.whl
pip install -r requirements.txt % in codes
pip install -e . % clone SAM2 and execute inside SAM2
We also provide the dockerfile under docker/.
Please download MMEB-train and MMEB-eval from the official datasets, and set up the corresponding paths in train_mmeb.yaml and eval_mmeb.yaml.
Please specify the paths of the parquet files in data_basedir in the virtual_eval.yaml. Please download raw images from RefCOCO+, RefCOCOg, COCO-Stuff, VisualGenome, and ADE20K.
It is required to prepare the sam2 checkpoints in advance and specify the corresponding path in:
sam_config:
config_path: "./sam2.1/sam2.1_hiera_b+.yaml"
checkpoint: "/your/path/to/sam2_checkpoints/sam2.1_hiera_base_plus.pt"
If using the VIRTUE family, model.py will load the trained checkpoints for SAM.
Under the codes folder:
bash scripts/train_local.sh
For detailed hyper-parameters, please refer to the yaml files under configs/.
Under the codes folder:
bash scripts/eval.sh
Change dataset_config in virtue_eval.yaml to eval_scar.yaml or eval_mmeb.yaml for SCaR or MMEB.
Under the codes folder:
python3 demo.py
Please change MSCOCO_IMAGES_DIR to the corresponding image folder.
convert_hf_model.pyconverts the trained PyTorch model to the HuggingFace format based on the model path invirtue_eval.yaml.
python3 convert_hf_model.py
You may need to copy some configuration files to the HF folder.
load_hf_model.pyprovides a quick example use to see if it works.
python3 HF_model_conversion/load_hf_model.py
- According to the authors of VLM2Vec_v2, the trainig codebase cannot naturally work for single-GPU, which will cause errors for GradCache.
For any questions or issues, please feel free to open an issue/PR or reach out to Wei-Yao Wang.
If you found this repository is relevant or useful to your research, please consider citing our paper:
@article{wangICLR2026virtue,
author = {Wei-Yao Wang and
Kazuya Tateishi and
Qiyu Wu and
Shusuke Takahashi and
Yuki Mitsufuji},
title = {VIRTUE: Visual-Interactive Text-Image Universal Embedder},
journal = {arXiv preprint arXiv:2510.00523},
year = {2025}
}
