VLTest is a black-box testing framework for vision–language models that systematically generates semantics-preserving test cases. It explores discrete visual latent spaces and bounded textual neighborhoods to uncover behavioral inconsistencies across VLM tasks, datasets, and architectures.
├── README.md
├── VLTest
├─beit3
│ └─get_started
├─configs
├─data
│ ├─coco
│ │ ├─annotations
│ │ ├─features
│ │ ├─train
│ │ └─val
│ ├─nlvr2
│ │ ├─annotations
│ │ ├─features
│ │ ├─ref_val_cache
│ │ ├─train
│ │ │ ├─images_left
│ │ │ └─images_right
│ │ └─val
│ │ ├─images_left
│ │ └─images_right
│ └─vqav2
│ ├─annotations
│ ├─features
│ ├─train
│ └─val
├── environment.yml
├─humaneval
└─taming
├─data
│ └─conditional_builder
├─models
└─modules
We evaluate VLTest on three representative vision–language tasks, covering image–text matching, visual reasoning, and visual question answering. All models under test are publicly available, while VQGAN is used solely as a visual generator to support discrete latent-space mutation and is not treated as a model under test. VQGAN
| Task | Dataset | Models |
|---|---|---|
| Image–Text Matching | MSCOCO | ViLT_COCO & BLIP |
| Visual Reasoning | NLVR2 | PaliGemma & BLIP-2 |
| Visual Question Answering | VQA v2 | ViLT_NLVR2 & BEiT-3 |
Models and configs/ckpt:
Results: You can find them in the VLTest/outputs directory.
Logs: You can download them via this Link.
We follow the order in paper RQ to explain usage operations.
- Clone this repository (if you haven't already):
git clone https://github.com/ALinrunrun/VLTest.git
cd VLTest
- Create a Conda environment from the environment.yml file:
conda env create -f environment.yml
conda activate taming_ada
or
conda env create -f environment.yml -n your_custom_env_name
conda activate your_custom_env_name
Each folder contains a preprocessing script for the corresponding dataset. Please follow the script name and its inner instructions to perform data preprocessing The data source link can be found in Dataset and Models
get_image_captions.pyextracts captions corresponding to the target images from the dataset-provided JSON files.download_coco_images.pyandcollect_orig_images.pyare used for VQA v2 and NLVR2, respectively, to retrieve the original images from the dataset JSON files.
Run python run_vltest.py to perturb images and apply the same operations for baseline perturbation(please run python run_vltest.py)
In RQ1 we obtain experimental results by adjusting parameters in these two files You can quickly reproduce the results by modifying the same parameters
eg.
import os
os.system(
"CUDA_VISIBLE_DEVICES=1 python perturber.py \ # run on GPU 1
--output_path=outputs\ # output directory
--task=itm \ # task: Image–Text Matching
--perturb_mode=joint \ # joint image–text perturbation
--max_samples=100 \ # number of seed samples
--pert_budget=100 \ # total perturbation budget for each modality (image and text)
--pict_top_p_ratio=0.1\ # top-p ratio for image codebook candidates
--pict_pert_time=40 \ # max image perturbation attempts
--pict_pert_mode=uniform \ # uniform image perturbation
--text_variant_num=5 \ # text variants per seed
--text_max_word=5 \ # max words changed per variant
--text_max_word_attempts=5 \ # max attempts per word
--seed 123456" # random seed for reproducibility
)
import os
os.system(
"CUDA_VISIBLE_DEVICES=0 python perturber_baseline.py \ # run baseline on GPU 0
--output_path=outputs\ # output directory
--task=itm \ # task: Image–Text Matching
--seed_nums=35 \ # number of seeds to use
--pre_seed_num=10 \ # number of seeds actually perturbed (remaining are reference seeds)
--pre_level_start=0\ # initial perturbation level
--pre_level_step=0.0005 \ # perturbation level increment
--pre_time=400 \ # total baseline perturbation steps
--seed 123456" # random seed for reproducibility
)
Step 1 — Install dependencies
pip install -r requirements.txt
Step 2 — Run the evaluation tool
python eval.py
Step 3 — Enter your username when prompted.
A file `<username>_results.json` will be created automatically to store your scores.
Step 4 — Rate all samples Check both panels, then rate each Image-Text pair (1–5) for:
* Image Semantic Preservation (Img-SemPres)
* Text Semantic Preservation (Txt-SemPres)
* Image–Text Alignment (ImgTxt-Align)
Use “Save and Next” to move through samples.
For more details refer to Here
Here, we primarily conduct metric statistics and differential testing on the perturbed images. We mainly rely on two files, metrics.py and diff_test.py.
We take the ITM task as an example to demonstrate how to use them.
For metrics.py
- Before computing the metrics, run
compute_bins_dino.pyin each dataset directory to obtain the model’s distribution, which is required for subsequent metric computation. - Make sure all generated files are saved under
./outputs. - Set
MODELin the metrics file according to the current task. - Specify the
RESULT_JSON_PATH.
MODEL = "image"
RESULT_JSON_PATH = "summary_metrics.json"
OUTPUT = "./outputs"
- Specify the name of the output JSON file.
For diff_test.py
- We continue to use automatic folder iteration: all contents under
TEST_ROOTwill be evaluated sequentially. Please ensure the directory structure is correct. PAIR_BUDGETandTEXT_POOL_BUDGETcorrespond to the settings in RQ3 and should be adjusted accordingly.
PAIR_BUDGET = 1000
TEXT_POOL_BUDGET = 100
TEST_ROOT = "diff0"