Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks

Standard REC benchmarks allow models to take shortcuts via the above issues. Ref-Adv addresses these by pairing complex referring expressions with hard visual distractors.

Introduction

Referring Expression Comprehension (REC) links natural language to region-level visual perception—given an image and a text expression, the task is to localize the described object. Standard benchmarks such as RefCOCO, RefCOCO+, and RefCOCOg have driven years of progress, yet they harbor critical shortcuts:

Expressions are too short (avg. ~3 words), leaving little reasoning demand.
Few visual distractors make the target easy to find by elimination.
Redundant descriptors let models latch onto a single cue and ignore the rest.

Ref-Adv is a modern REC benchmark designed to suppress these shortcuts. Every referring expression is paired with only the information necessary to uniquely identify the target among hard visual distractors. The dataset features an average expression length of 11.5 words, 4.01 distractors per image (each case contains at least 2 distractors), and a 21.25% negation ratio—substantially surpassing existing benchmarks in both linguistic and visual complexity.

We open source Ref-Adv-s with 1,142 cases, and have reproducible results with code on Qwen2.5-3.5 VL series models.

Explore

Ref-Adv pairs complex referring expressions with hard visual distractors. Browse examples below, or try the interactive challenge.

Gallery
Interactive

Results on Ref-Adv-s

Ref-Adv-s is a publicly released subset of 1,142 cases with evaluation code. We report Acc@IoU thresholds and per-distractor-group accuracy (with delta vs. overall Acc@0.5). Bold = best in group.

Model	CoT	Accuracy			Distractors (Acc@0.5)
Model	CoT	@0.5	@0.75	@0.9	2-3	Δ	4-6	Δ	≥7	Δ
Human Expert (High)*	—	90.3	—	—	—	—	—	—	—	—
Human Expert (Medium)*	—	80.6	—	—	—	—	—	—	—	—
Qwen2.5-VL
3B-Instruct		23.8	18.1	8.8	25.9	+2.1	21.9	-1.9	17.1	-6.8
3B-Instruct	✓	25.3	19.1	9.5	28.2	+2.9	22.9	-2.4	15.5	-9.8
7B-Instruct		39.3	29.2	12.5	42.8	+3.5	36.8	-2.5	26.4	-13.0
7B-Instruct	✓	39.0	28.8	11.6	43.0	+4.0	35.2	-3.7	26.4	-12.6
32B-Instruct		48.0	35.5	16.0	51.6	+3.6	43.8	-4.2	38.8	-9.2
32B-Instruct	✓	50.6	37.7	16.0	55.2	+4.5	44.8	-5.9	40.3	-10.3
72B-Instruct		54.0	40.1	18.0	57.0	+3.0	52.7	-1.3	41.1	-12.9
72B-Instruct	✓	52.4	39.0	18.3	56.9	+4.5	47.9	-4.4	38.8	-13.6
Qwen3-VL
2B-Instruct		23.5	19.2	11.0	26.1	+2.6	20.0	-3.5	17.8	-5.6
2B-Instruct	✓	25.2	20.6	11.4	28.1	+2.9	21.3	-3.9	19.4	-5.8
2B-Thinking	✓	44.4	36.8	21.8	48.6	+4.2	40.6	-3.8	31.0	-13.4
4B-Instruct		41.9	34.9	20.7	46.4	+4.5	36.2	-5.8	31.8	-10.2
4B-Instruct	✓	42.5	34.9	20.6	46.6	+4.1	36.5	-6.0	34.9	-7.6
4B-Thinking	✓	57.6	45.5	27.8	63.0	+5.4	52.7	-4.9	40.3	-17.3
8B-Instruct		47.2	37.0	19.1	51.3	+4.1	44.1	-3.1	32.6	-14.6
8B-Instruct	✓	52.3	38.9	19.9	55.7	+3.5	50.2	-2.1	38.8	-13.5
8B-Thinking	✓	59.5	48.2	27.3	63.5	+4.0	55.6	-3.9	47.3	-12.2
30B-A3B-Instruct		44.0	37.6	23.4	47.6	+3.5	40.3	-3.7	34.1	-9.9
30B-A3B-Instruct	✓	52.1	43.1	27.4	54.7	+2.6	48.9	-3.2	45.7	-6.4
30B-A3B-Thinking	✓	64.1	52.6	31.6	67.3	+3.2	62.2	-1.9	51.2	-12.9
32B-Instruct		53.4	44.7	27.1	56.3	+2.9	50.2	-3.3	45.7	-7.7
32B-Instruct	✓	59.0	47.5	27.6	60.9	+1.9	57.5	-1.6	52.7	-6.3
32B-Thinking	✓	65.6	52.8	31.6	67.9	+2.3	65.7	+0.1	52.7	-12.9
235B-A22B-Instruct		57.3	47.5	30.0	63.3	+6.1	51.7	-5.5	38.0	-19.3
235B-A22B-Instruct	✓	59.3	48.9	29.9	63.5	+4.2	54.9	-4.4	47.3	-12.0
235B-A22B-Thinking	✓	67.1	53.6	31.8	69.6	+2.6	65.7	-1.4	56.6	-10.5
Qwen3.5
27B	✓	67.3	54.9	32.7	69.9	+2.7	65.7	-1.5	56.6	-10.7
35B-A3B	✓	66.7	54.4	34.9	68.9	+2.2	65.4	-1.3	58.1	-8.6
122B-A10B	✓	67.2	55.0	35.1	69.9	+2.8	66.3	-0.8	54.3	-12.9
397B-A17B-FP8	✓	68.0	55.6	34.2	70.1	+2.1	67.9	-0.0	56.6	-11.4
* Human expert results are evaluated on a randomly selected subset of the benchmark.

BibTeX

@inproceedings{
    dong2026refadv,
    title={Ref-Adv: Exploring {MLLM} Visual Reasoning in Referring Expression Tasks},
    author={Qihua Dong and Kuo Yang and Lin Ju and Handong Zhao and Yitian Zhang and Yizhou Wang and Huimin Zeng and Jianglin Lu and Yun Fu},
    booktitle={The Fourteenth International Conference on Learning Representations},
    year={2026},
    url={https://openreview.net/forum?id=iEBgrepR9i}
}