Dual-Reward Reinforcement Learning for Open-Vocabulary Intention-Guided Object Detection in Diverse Scenes

Abstract

Accurately identifying and localizing objects that satisfy implicit user intentions remains a fundamental challenge for intelligent assistive systems. We present TriIntentBench, a large-scale benchmark for open-vocabulary intention-guided object detection with free-form, implicit intention descriptions and instance-level bounding-box annotations spanning egocentric, indoor third-person, and outdoor scenarios. We further propose Intent-R1, a two-stage post-training framework for pretrained vision-language models: supervised fine-tuning for task alignment, followed by dual-reward GRPO post-training that focuses updates on informative samples and jointly optimizes localization accuracy and intention-region consistency. Extensive experiments show robust generalization across all three scenario types, achieving mIoU of 64.12, 56.35, and 65.87 on egocentric, indoor, and outdoor subsets, respectively.

Overview

Overview of our Open-Vocabulary Intention-Guided Object Detection (OV-IGOD) task. Given an image and a free-form intention description (e.g., "I'm about to leave the house and I need to make sure I won't lose track of time while I'm out"), our model localizes the object instance that best satisfies the user's intention (e.g., a wristwatch), even when the target object is not explicitly specified in the intention description.

Key Contributions

We introduce TriIntentBench, a large-scale benchmark for open-vocabulary intention-guided object detection with free-form, implicit intention descriptions and instance-level bounding-box annotations, spanning egocentric, indoor third-person, and outdoor scenarios.
We propose Intent-R1, a two-stage training pipeline for open-vocabulary intention-guided object detection: supervised fine-tuning for task alignment, followed by dual-reward GRPO post-training with an update scheme that focuses RL updates on informative samples.
Extensive experiments on TriIntentBench demonstrate that our approach generalizes robustly across egocentric and third-person indoor/outdoor scenarios, consistently outperforming strong open-vocabulary object detector/segmenter and VLM baselines.

Method: Intent-R1

Overview of our two-stage training pipeline. Given (x_img, x_int), we first perform SFT to adapt Rex-Omni to OV-IGOD, and then apply Dual-Reward GRPO post-training with an IoU-based geometry reward and an FG-CLIP intention-alignment reward, regularized by KL divergence to a frozen reference policy.

We propose a two-stage training pipeline that performs SFT for task alignment, followed by dual-reward GRPO post-training with an update scheme that focuses RL updates on informative samples.
In the GRPO stage, we further introduce an intention-alignment reward based on text-region similarity, and combine it with an IoU-derived geometry reward to form a dual-reward objective.

Quantitative Results

Quantitative comparison on TriIntentBench across Egocentric, 3rd-Person Indoor, and 3rd-Person Outdoor subsets. We report mIoU and AP (%) at IoU thresholds 0.50, 0.75, and 0.50:0.95.

Method	Egocentric				3rd-Person Indoor				3rd-Person Outdoor
Method	mIoU	AP@50	AP@75	AP50:95	mIoU	AP@50	AP@75	AP50:95	mIoU	AP@50	AP@75	AP50:95
Open-vocabulary object detectors / segmenters
YOLO-World	11.18	9.07	7.45	7.07	12.20	9.55	6.26	6.48	13.50	11.20	8.50	9.30
GroundingDINO-T	12.95	10.30	8.20	7.88	10.71	8.53	5.89	5.90	14.72	12.16	9.60	9.36
GroundingDINO-B	13.08	10.54	8.54	8.15	12.84	10.28	7.44	7.46	14.60	12.25	9.68	9.41
SAM3	16.38	13.25	10.52	10.28	11.93	9.34	5.93	5.97	12.20	10.80	7.90	7.60
Open-source VLMs
Qwen3-VL-2B	29.98	30.74	26.34	25.01	8.67	8.33	4.07	4.60	34.12	35.11	28.84	27.36
InternVL3.5-2B	22.45	22.02	10.83	11.94	23.87	23.43	14.65	14.76	35.90	37.75	29.86	28.15
Qwen2.5-VL-3B	15.95	13.41	8.97	9.12	21.27	18.98	8.13	9.44	40.34	42.15	34.78	33.14
Qwen3-VL-4B	51.25	53.36	48.27	45.66	37.67	39.77	30.54	28.74	46.10	49.02	41.01	38.68
InternVL3.5-4B	35.73	36.84	28.81	27.79	28.36	26.34	18.82	19.41	42.10	44.60	35.20	33.50
Qwen2.5-VL-7B	18.37	15.92	11.67	11.57	22.48	20.92	7.75	9.85	42.97	46.34	37.06	35.19
Qwen3-VL-8B	51.93	54.57	48.47	46.02	38.69	41.32	31.25	29.55	48.69	52.60	43.21	40.63
InternVL3.5-8B	38.58	40.90	31.51	30.02	30.60	31.31	22.85	21.78	47.57	51.10	42.55	40.20
MiniCPM-V-4.5-8B	19.34	16.87	12.59	12.48	23.67	21.44	9.13	10.28	41.72	43.56	35.29	34.15
GLM-4.1V-9B	52.46	55.73	48.47	45.60	41.56	44.99	33.38	31.90	47.78	51.38	39.83	38.23
Closed-source VLM
GPT-5.2	37.71	39.25	30.56	29.63	30.21	30.73	21.31	20.78	45.79	49.53	40.91	39.46
Intent-R1 (Ours)	64.12	68.44	60.13	56.95	56.35	60.87	46.36	45.18	65.87	74.12	57.89	54.78

Intent-R1 achieves the best score on every reported metric across all three subsets, with clear gains over strong open-vocabulary detector/segmenter and VLM baselines.

Ablation Study

Each component of Intent-R1 contributes to the overall performance.

Variant	SFT	GRPO	Dual reward	Data-efficient filter	Ego mIoU	Indoor mIoU	Outdoor mIoU	Avg. mIoU
(i)	✓				57.04	48.80	60.01	55.28
(ii)		✓	✓	✓	46.02	38.14	51.20	45.12
(iii)	✓	✓		✓	59.97	53.25	61.70	58.31
(iv)	✓	✓	✓		61.33	54.74	63.29	59.79
(v)	✓	✓	✓	✓	64.12	56.35	65.87	62.11