Superpowering Open-Vocabulary Object Detectors for X-ray Vision

ICCV 25

1University of Santiago de Compostela, Spain 2Fondazione Bruno Kessler, Italy 3University of Trento, Italy

Overview

Open-vocabulary object detection (OvOD) is set to revolutionize security screening by enabling systems to recognize any item in X-ray scans. However, developing effective OvOD models for X-ray imaging presents unique challenges due to data scarcity and the modality gap that prevents direct adoption of RGB-based solutions. To overcome these limitations, we propose RAXO, a training-free framework that repurposes off-the-shelf RGB OvOD detectors for robust X-ray detection. RAXO builds high-quality X-ray class descriptors using a dual-source retrieval strategy. It gathers relevant RGB images from the web and enriches them via a novel X-ray material transfer mechanism, eliminating the need for labeled databases. These visual descriptors replace text-based classification in OvOD, leveraging intra-modal feature distances for robust detection. Extensive experiments demonstrate that RAXO consistently improves OvOD performance, providing an average mAP increase of up to 17.0 points over base detectors. To further support research in this emerging field, we also introduce DET-COMPASS, a new benchmark featuring bounding box annotations for over 300 object categories, enabling large-scale evaluation of OvOD in X-ray.

DET-COMPASS Dataset

How RAXO Works?

Diagram showing RAXO architecture

RGB-based open-vocabulary object detectors often fail when applied to X-ray imagery. The core issue lies in the breakdown of alignment between textual and visual features: objects appear significantly different in RGB and X-ray domains, causing text embeddings—originally aligned with RGB features—to misalign with X-ray representations.

RAXO addresses this challenge with a simple and effective three-step process:

  1. Visual Sample Acquisition: The method starts by collecting X-ray images that visually represent the object categories defined by the user.
  2. Class Descriptor Modeling: Using the gathered samples, RAXO builds visual descriptors that capture the defining features of each category within the X-ray modality.
  3. Classifier Construction: These descriptors are then used to create a visual-based classifier, replacing the standard text-to-image classification approach. Instead of relying on cross-modal alignment, RAXO uses intra-modal (visual-to-visual) similarity—an approach that remains robust despite the domain shift.

Through this pipeline, RAXO enables existing OvOD models to accurately detect objects in X-ray images, leveraging their intrinsic visual characteristics without requiring additional training.

Results

BibTeX

@inproceedings{garcia2025superpowering,
  author = {Garcia-Fernandez, Pablo and Vaquero, Lorenzo and Liu, Mingxuan and Xue, Feng and Cores, Daniel and Sebe, Nicu and Mucientes, Manuel and Ricci, Elisa},
  title = {Superpowering Open-Vocabulary Object Detectors for X-ray Vision},
  year = {2025},
  booktitle={ICCV},
}