Dual-Reward Reinforcement Learning for Open-Vocabulary Intention-Guided Object Detection in Diverse Scenes

Anonymous Authors
ECCV 2026 Submission

Abstract

Accurately identifying and localizing objects that satisfy implicit user intentions remains a fundamental challenge for intelligent assistive systems. We present TriIntentBench, a large-scale benchmark for open-vocabulary intention-guided object detection with free-form, implicit intention descriptions and instance-level bounding-box annotations spanning egocentric, indoor third-person, and outdoor scenarios. We further propose Intent-R1, a two-stage post-training framework for pretrained vision-language models: supervised fine-tuning for task alignment, followed by dual-reward GRPO post-training that focuses updates on informative samples and jointly optimizes localization accuracy and intention-region consistency. Extensive experiments show robust generalization across all three scenario types, achieving mIoU of 64.12, 56.35, and 65.87 on egocentric, indoor, and outdoor subsets, respectively.

Overview

Overview of our Open-Vocabulary Intention-Guided Object Detection (OV-IGOD) task. Given an image and a free-form intention description (e.g., "I'm about to leave the house and I need to make sure I won't lose track of time while I'm out"), our model localizes the object instance that best satisfies the user's intention (e.g., a wristwatch), even when the target object is not explicitly specified in the intention description.

Task overview of OV-IGOD

Key Contributions

  • We introduce TriIntentBench, a large-scale benchmark for open-vocabulary intention-guided object detection with free-form, implicit intention descriptions and instance-level bounding-box annotations, spanning egocentric, indoor third-person, and outdoor scenarios.
  • We propose Intent-R1, a two-stage training pipeline for open-vocabulary intention-guided object detection: supervised fine-tuning for task alignment, followed by dual-reward GRPO post-training with an update scheme that focuses RL updates on informative samples.
  • Extensive experiments on TriIntentBench demonstrate that our approach generalizes robustly across egocentric and third-person indoor/outdoor scenarios, consistently outperforming strong open-vocabulary object detector/segmenter and VLM baselines.

Method: Intent-R1

Overview of our two-stage training pipeline. Given (x_img, x_int), we first perform SFT to adapt Rex-Omni to OV-IGOD, and then apply Dual-Reward GRPO post-training with an IoU-based geometry reward and an FG-CLIP intention-alignment reward, regularized by KL divergence to a frozen reference policy.

Intent-R1 two-stage training overview
  • We propose a two-stage training pipeline that performs SFT for task alignment, followed by dual-reward GRPO post-training with an update scheme that focuses RL updates on informative samples.
  • In the GRPO stage, we further introduce an intention-alignment reward based on text-region similarity, and combine it with an IoU-derived geometry reward to form a dual-reward objective.

Quantitative Results

Quantitative comparison on TriIntentBench across Egocentric, 3rd-Person Indoor, and 3rd-Person Outdoor subsets. We report mIoU and AP (%) at IoU thresholds 0.50, 0.75, and 0.50:0.95.

Method Egocentric 3rd-Person Indoor 3rd-Person Outdoor
mIoU AP@50 AP@75 AP50:95 mIoU AP@50 AP@75 AP50:95 mIoU AP@50 AP@75 AP50:95
Open-vocabulary object detectors / segmenters
YOLO-World 11.189.077.457.07 12.209.556.266.48 13.5011.208.509.30
GroundingDINO-T 12.9510.308.207.88 10.718.535.895.90 14.7212.169.609.36
GroundingDINO-B 13.0810.548.548.15 12.8410.287.447.46 14.6012.259.689.41
SAM3 16.3813.2510.5210.28 11.939.345.935.97 12.2010.807.907.60
Open-source VLMs
Qwen3-VL-2B 29.9830.7426.3425.01 8.678.334.074.60 34.1235.1128.8427.36
InternVL3.5-2B 22.4522.0210.8311.94 23.8723.4314.6514.76 35.9037.7529.8628.15
Qwen2.5-VL-3B 15.9513.418.979.12 21.2718.988.139.44 40.3442.1534.7833.14
Qwen3-VL-4B 51.2553.3648.2745.66 37.6739.7730.5428.74 46.1049.0241.0138.68
InternVL3.5-4B 35.7336.8428.8127.79 28.3626.3418.8219.41 42.1044.6035.2033.50
Qwen2.5-VL-7B 18.3715.9211.6711.57 22.4820.927.759.85 42.9746.3437.0635.19
Qwen3-VL-8B 51.9354.5748.4746.02 38.6941.3231.2529.55 48.6952.6043.2140.63
InternVL3.5-8B 38.5840.9031.5130.02 30.6031.3122.8521.78 47.5751.1042.5540.20
MiniCPM-V-4.5-8B 19.3416.8712.5912.48 23.6721.449.1310.28 41.7243.5635.2934.15
GLM-4.1V-9B 52.4655.7348.4745.60 41.5644.9933.3831.90 47.7851.3839.8338.23
Closed-source VLM
GPT-5.2 37.7139.2530.5629.63 30.2130.7321.3120.78 45.7949.5340.9139.46
Intent-R1 (Ours) 64.1268.4460.1356.95 56.3560.8746.3645.18 65.8774.1257.8954.78

Intent-R1 achieves the best score on every reported metric across all three subsets, with clear gains over strong open-vocabulary detector/segmenter and VLM baselines.

Qualitative Results

Comparison of predictions by Ground Truth, strong VLM baselines, and Intent-R1. Green boxes indicate targets and red boxes indicate model predictions.

Qualitative success cases

Ablation Study

Each component of Intent-R1 contributes to the overall performance.

Variant SFT GRPO Dual reward Data-efficient filter Ego mIoU Indoor mIoU Outdoor mIoU Avg. mIoU
(i) 57.04 48.80 60.01 55.28
(ii) 46.02 38.14 51.20 45.12
(iii) 59.97 53.25 61.70 58.31
(iv) 61.33 54.74 63.29 59.79
(v) 64.12 56.35 65.87 62.11

BibTeX

Coming Soon...