Superpowering Open-Vocabulary Object Detectors for X-ray Vision

ICCV 25

¹University of Santiago de Compostela, Spain ²Fondazione Bruno Kessler, Italy ³University of Trento, Italy

Overview

Open-vocabulary object detection (OvOD) is set to revolutionize security screening by enabling systems to recognize any item in X-ray scans. However, developing effective OvOD models for X-ray imaging presents unique challenges due to data scarcity and the modality gap that prevents direct adoption of RGB-based solutions. To overcome these limitations, we propose RAXO, a training-free framework that repurposes off-the-shelf RGB OvOD detectors for robust X-ray detection. RAXO builds high-quality X-ray class descriptors using a dual-source retrieval strategy. It gathers relevant RGB images from the web and enriches them via a novel X-ray material transfer mechanism, eliminating the need for labeled databases. These visual descriptors replace text-based classification in OvOD, leveraging intra-modal feature distances for robust detection. Extensive experiments demonstrate that RAXO consistently improves OvOD performance, providing an average mAP increase of up to 17.0 points over base detectors. To further support research in this emerging field, we also introduce DET-COMPASS, a new benchmark featuring bounding box annotations for over 300 object categories, enabling large-scale evaluation of OvOD in X-ray.

DET-COMPASS Dataset

Object detection in security X-ray scans has advanced significantly in recent years. However, evaluating OvOD detectors in this modality remains challenging due to the limited number of annotated object categories in existing X-ray benchmarks. This limitation severely constrains the comprehensive evaluation of OvOD methods, which require a broad and diverse category set to assess generalization to unseen object semantics. To address this gap, we introduce DET-COMPASS, a novel benchmark that repurposes the COMPASS-XP classification dataset for object detection through meticulous bounding box annotation. DET-COMPASS comprises 370 distinct object classes, offering an order-of-magnitude increase in vocabulary size over previous X-ray detection benchmarks. Additionally, it provides pixel-aligned RGB images, ensuring precise spatial correspondence across modalities and facilitating the development of multimodal models. Each object is also labeled with a visibility attribute, indicating whether it produces a discernible signature in the X-ray spectrum.

How RAXO Works?

RGB-based open-vocabulary object detectors often fail when applied to X-ray imagery. The core issue lies in the breakdown of alignment between textual and visual features: objects appear significantly different in RGB and X-ray domains, causing text embeddings—originally aligned with RGB features—to misalign with X-ray representations.

RAXO addresses this challenge with a simple and effective three-step process:

Visual Sample Acquisition: The method starts by collecting X-ray images that visually represent the object categories defined by the user.
Class Descriptor Modeling: Using the gathered samples, RAXO builds visual descriptors that capture the defining features of each category within the X-ray modality.
Classifier Construction: These descriptors are then used to create a visual-based classifier, replacing the standard text-to-image classification approach. Instead of relying on cross-modal alignment, RAXO uses intra-modal (visual-to-visual) similarity—an approach that remains robust despite the domain shift.

Through this pipeline, RAXO enables existing OvOD models to accurately detect objects in X-ray images, leveraging their intrinsic visual characteristics without requiring additional training.

BibTeX

@inproceedings{garcia2025superpowering, author = {Garcia-Fernandez, Pablo and Vaquero, Lorenzo and Liu, Mingxuan and Xue, Feng and Cores, Daniel and Sebe, Nicu and Mucientes, Manuel and Ricci, Elisa}, title = {Superpowering Open-Vocabulary Object Detectors for X-ray Vision}, year = {2025}, booktitle={ICCV}, }