Skip to content

ALinrunrun/VLTest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 

Repository files navigation

VLTest

VLTest is a black-box testing framework for vision–language models that systematically generates semantics-preserving test cases. It explores discrete visual latent spaces and bounded textual neighborhoods to uncover behavioral inconsistencies across VLM tasks, datasets, and architectures.

License

Overview

Folder Structure

├── README.md
├── VLTest
    ├─beit3
    │  └─get_started
    ├─configs
    ├─data
    │  ├─coco
    │  │  ├─annotations
    │  │  ├─features
    │  │  ├─train
    │  │  └─val
    │  ├─nlvr2
    │  │  ├─annotations
    │  │  ├─features
    │  │  ├─ref_val_cache
    │  │  ├─train
    │  │  │  ├─images_left
    │  │  │  └─images_right
    │  │  └─val
    │  │      ├─images_left
    │  │      └─images_right
    │  └─vqav2
    │      ├─annotations
    │      ├─features
    │      ├─train
    │      └─val
    ├── environment.yml
    ├─humaneval
    └─taming
        ├─data
        │  └─conditional_builder
        ├─models
        └─modules

Dataset and Models

We evaluate VLTest on three representative vision–language tasks, covering image–text matching, visual reasoning, and visual question answering. All models under test are publicly available, while VQGAN is used solely as a visual generator to support discrete latent-space mutation and is not treated as a model under test. VQGAN

Task Dataset Models
Image–Text Matching MSCOCO ViLT_COCO & BLIP
Visual Reasoning NLVR2 PaliGemma & BLIP-2
Visual Question Answering VQA v2 ViLT_NLVR2 & BEiT-3

Resource download

Models and configs/ckpt:

Results: You can find them in the VLTest/outputs directory.

Logs: You can download them via this Link.

Experiments

We follow the order in paper RQ to explain usage operations.

Environment Configuration

  1. Clone this repository (if you haven't already):
git clone https://github.com/ALinrunrun/VLTest.git
cd VLTest
  1. Create a Conda environment from the environment.yml file:
conda env create -f environment.yml
conda activate taming_ada

or

conda env create -f environment.yml -n your_custom_env_name
conda activate your_custom_env_name

Data Preprocessing

Each folder contains a preprocessing script for the corresponding dataset. Please follow the script name and its inner instructions to perform data preprocessing The data source link can be found in Dataset and Models

  • get_image_captions.py extracts captions corresponding to the target images from the dataset-provided JSON files.
  • download_coco_images.py and collect_orig_images.py are used for VQA v2 and NLVR2, respectively, to retrieve the original images from the dataset JSON files.

RQ1: Exploration Capability Analysis

Run python run_vltest.py to perturb images and apply the same operations for baseline perturbation(please run python run_vltest.py)

In RQ1 we obtain experimental results by adjusting parameters in these two files You can quickly reproduce the results by modifying the same parameters

eg.

import os

os.system(
    "CUDA_VISIBLE_DEVICES=1 python perturber.py \  # run on GPU 1
    --output_path=outputs\                         # output directory
    --task=itm \                                   # task: Image–Text Matching
    --perturb_mode=joint \                         # joint image–text perturbation
    --max_samples=100 \                            # number of seed samples
    --pert_budget=100 \                            # total perturbation budget for each modality (image and text)
    --pict_top_p_ratio=0.1\                        # top-p ratio for image codebook candidates
    --pict_pert_time=40 \                          # max image perturbation attempts
    --pict_pert_mode=uniform \                     # uniform image perturbation
    --text_variant_num=5 \                         # text variants per seed
    --text_max_word=5 \                            # max words changed per variant
    --text_max_word_attempts=5 \                   # max attempts per word
    --seed 123456"                                 # random seed for reproducibility
)
import os

os.system(
    "CUDA_VISIBLE_DEVICES=0 python perturber_baseline.py \  # run baseline on GPU 0
    --output_path=outputs\                               # output directory
    --task=itm \                                         # task: Image–Text Matching
    --seed_nums=35 \                                     # number of seeds to use
    --pre_seed_num=10 \                                  # number of seeds actually perturbed (remaining are reference seeds)
    --pre_level_start=0\                                 # initial perturbation level
    --pre_level_step=0.0005 \                            # perturbation level increment
    --pre_time=400 \                                     # total baseline perturbation steps
    --seed 123456"                                       # random seed for reproducibility
)

RQ2: Semantic Validity (Human Evaluation)

Step 1 — Install dependencies

pip install -r requirements.txt

Step 2 — Run the evaluation tool

python eval.py

Step 3 — Enter your username when prompted.

A file `<username>_results.json` will be created automatically to store your scores.

Step 4 — Rate all samples Check both panels, then rate each Image-Text pair (1–5) for:

* Image Semantic Preservation (Img-SemPres)
* Text Semantic Preservation (Txt-SemPres)
* Image–Text Alignment (ImgTxt-Align)

Use “Save and Next” to move through samples.

For more details refer to Here

RQ3: Failure Discovery Effectiveness and Efficiency

Here, we primarily conduct metric statistics and differential testing on the perturbed images. We mainly rely on two files, metrics.py and diff_test.py. We take the ITM task as an example to demonstrate how to use them.

For metrics.py

  • Before computing the metrics, run compute_bins_dino.py in each dataset directory to obtain the model’s distribution, which is required for subsequent metric computation.
  • Make sure all generated files are saved under ./outputs.
  • Set MODEL in the metrics file according to the current task.
  • Specify the RESULT_JSON_PATH.
MODEL = "image"
RESULT_JSON_PATH = "summary_metrics.json"
OUTPUT = "./outputs"
  1. Specify the name of the output JSON file.

For diff_test.py

  • We continue to use automatic folder iteration: all contents under TEST_ROOT will be evaluated sequentially. Please ensure the directory structure is correct.
  • PAIR_BUDGET and TEXT_POOL_BUDGET correspond to the settings in RQ3 and should be adjusted accordingly.
PAIR_BUDGET = 1000
TEXT_POOL_BUDGET = 100
TEST_ROOT = "diff0"

About

VLTest is a black-box testing framework for vision–language models that systematically generates semantics-preserving test cases. It explores discrete visual latent spaces and bounded textual neighborhoods to uncover behavioral inconsistencies across VLM tasks, datasets, and architectures.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages