Accurately identifying and localizing objects that satisfy implicit user intentions remains a fundamental challenge for intelligent assistive systems. We present TriIntentBench, a large-scale benchmark for open-vocabulary intention-guided object detection with free-form, implicit intention descriptions and instance-level bounding-box annotations spanning egocentric, indoor third-person, and outdoor scenarios. We further propose Intent-R1, a two-stage post-training framework for pretrained vision-language models: supervised fine-tuning for task alignment, followed by dual-reward GRPO post-training that focuses updates on informative samples and jointly optimizes localization accuracy and intention-region consistency. Extensive experiments show robust generalization across all three scenario types, achieving mIoU of 64.12, 56.35, and 65.87 on egocentric, indoor, and outdoor subsets, respectively.
Overview of our two-stage training pipeline. Given (x_img, x_int), we first perform SFT to adapt Rex-Omni to OV-IGOD, and then apply Dual-Reward GRPO post-training with an IoU-based geometry reward and an FG-CLIP intention-alignment reward, regularized by KL divergence to a frozen reference policy.
Quantitative comparison on TriIntentBench across Egocentric, 3rd-Person Indoor, and 3rd-Person Outdoor subsets. We report mIoU and AP (%) at IoU thresholds 0.50, 0.75, and 0.50:0.95.
| Method | Egocentric | 3rd-Person Indoor | 3rd-Person Outdoor | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| mIoU | AP@50 | AP@75 | AP50:95 | mIoU | AP@50 | AP@75 | AP50:95 | mIoU | AP@50 | AP@75 | AP50:95 | |
| Open-vocabulary object detectors / segmenters | ||||||||||||
| YOLO-World | 11.18 | 9.07 | 7.45 | 7.07 | 12.20 | 9.55 | 6.26 | 6.48 | 13.50 | 11.20 | 8.50 | 9.30 |
| GroundingDINO-T | 12.95 | 10.30 | 8.20 | 7.88 | 10.71 | 8.53 | 5.89 | 5.90 | 14.72 | 12.16 | 9.60 | 9.36 |
| GroundingDINO-B | 13.08 | 10.54 | 8.54 | 8.15 | 12.84 | 10.28 | 7.44 | 7.46 | 14.60 | 12.25 | 9.68 | 9.41 |
| SAM3 | 16.38 | 13.25 | 10.52 | 10.28 | 11.93 | 9.34 | 5.93 | 5.97 | 12.20 | 10.80 | 7.90 | 7.60 |
| Open-source VLMs | ||||||||||||
| Qwen3-VL-2B | 29.98 | 30.74 | 26.34 | 25.01 | 8.67 | 8.33 | 4.07 | 4.60 | 34.12 | 35.11 | 28.84 | 27.36 |
| InternVL3.5-2B | 22.45 | 22.02 | 10.83 | 11.94 | 23.87 | 23.43 | 14.65 | 14.76 | 35.90 | 37.75 | 29.86 | 28.15 |
| Qwen2.5-VL-3B | 15.95 | 13.41 | 8.97 | 9.12 | 21.27 | 18.98 | 8.13 | 9.44 | 40.34 | 42.15 | 34.78 | 33.14 |
| Qwen3-VL-4B | 51.25 | 53.36 | 48.27 | 45.66 | 37.67 | 39.77 | 30.54 | 28.74 | 46.10 | 49.02 | 41.01 | 38.68 |
| InternVL3.5-4B | 35.73 | 36.84 | 28.81 | 27.79 | 28.36 | 26.34 | 18.82 | 19.41 | 42.10 | 44.60 | 35.20 | 33.50 |
| Qwen2.5-VL-7B | 18.37 | 15.92 | 11.67 | 11.57 | 22.48 | 20.92 | 7.75 | 9.85 | 42.97 | 46.34 | 37.06 | 35.19 |
| Qwen3-VL-8B | 51.93 | 54.57 | 48.47 | 46.02 | 38.69 | 41.32 | 31.25 | 29.55 | 48.69 | 52.60 | 43.21 | 40.63 |
| InternVL3.5-8B | 38.58 | 40.90 | 31.51 | 30.02 | 30.60 | 31.31 | 22.85 | 21.78 | 47.57 | 51.10 | 42.55 | 40.20 |
| MiniCPM-V-4.5-8B | 19.34 | 16.87 | 12.59 | 12.48 | 23.67 | 21.44 | 9.13 | 10.28 | 41.72 | 43.56 | 35.29 | 34.15 |
| GLM-4.1V-9B | 52.46 | 55.73 | 48.47 | 45.60 | 41.56 | 44.99 | 33.38 | 31.90 | 47.78 | 51.38 | 39.83 | 38.23 |
| Closed-source VLM | ||||||||||||
| GPT-5.2 | 37.71 | 39.25 | 30.56 | 29.63 | 30.21 | 30.73 | 21.31 | 20.78 | 45.79 | 49.53 | 40.91 | 39.46 |
| Intent-R1 (Ours) | 64.12 | 68.44 | 60.13 | 56.95 | 56.35 | 60.87 | 46.36 | 45.18 | 65.87 | 74.12 | 57.89 | 54.78 |
Intent-R1 achieves the best score on every reported metric across all three subsets, with clear gains over strong open-vocabulary detector/segmenter and VLM baselines.
Comparison of predictions by Ground Truth, strong VLM baselines, and Intent-R1. Green boxes indicate targets and red boxes indicate model predictions.
Each component of Intent-R1 contributes to the overall performance.
| Variant | SFT | GRPO | Dual reward | Data-efficient filter | Ego mIoU | Indoor mIoU | Outdoor mIoU | Avg. mIoU |
|---|---|---|---|---|---|---|---|---|
| (i) | ✓ | 57.04 | 48.80 | 60.01 | 55.28 | |||
| (ii) | ✓ | ✓ | ✓ | 46.02 | 38.14 | 51.20 | 45.12 | |
| (iii) | ✓ | ✓ | ✓ | 59.97 | 53.25 | 61.70 | 58.31 | |
| (iv) | ✓ | ✓ | ✓ | 61.33 | 54.74 | 63.29 | 59.79 | |
| (v) | ✓ | ✓ | ✓ | ✓ | 64.12 | 56.35 | 65.87 | 62.11 |
Coming Soon...