Ref-Adv

Exploring MLLM Visual Reasoning in Referring Expression Tasks

Northeastern University
ICLR 2026
Limitations of classic REC benchmarks

Standard REC benchmarks allow models to take shortcuts via the above issues. Ref-Adv addresses these by pairing complex referring expressions with hard visual distractors.

Introduction

Referring Expression Comprehension (REC) links natural language to region-level visual perception—given an image and a text expression, the task is to localize the described object. Standard benchmarks such as RefCOCO, RefCOCO+, and RefCOCOg have driven years of progress, yet they harbor critical shortcuts:

  1. Expressions are too short (avg. ~3 words), leaving little reasoning demand.
  2. Few visual distractors make the target easy to find by elimination.
  3. Redundant descriptors let models latch onto a single cue and ignore the rest.

Ref-Adv is a modern REC benchmark designed to suppress these shortcuts. Every referring expression is paired with only the information necessary to uniquely identify the target among hard visual distractors. The dataset features an average expression length of 11.5 words, 4.01 distractors per image (each case contains at least 2 distractors), and a 21.25% negation ratio—substantially surpassing existing benchmarks in both linguistic and visual complexity.

We open source Ref-Adv-s with 1,142 cases, and have reproducible results with code on Qwen2.5-3.5 VL series models.

Explore

Ref-Adv pairs complex referring expressions with hard visual distractors. Browse examples below, or try the interactive challenge.

Results on Ref-Adv-s

Ref-Adv-s is a publicly released subset of 1,142 cases with evaluation code. We report Acc@IoU thresholds and per-distractor-group accuracy (with delta vs. overall Acc@0.5). Bold = best in group.

Model CoT Accuracy Distractors (Acc@0.5)
@0.5 @0.75 @0.9 2-3 Δ 4-6 Δ ≥7 Δ
Human Expert (High)* 90.3
Human Expert (Medium)* 80.6
Qwen2.5-VL
3B-Instruct 23.818.18.8 25.9+2.1 21.9-1.9 17.1-6.8
3B-Instruct 25.319.19.5 28.2+2.9 22.9-2.4 15.5-9.8
7B-Instruct 39.329.212.5 42.8+3.5 36.8-2.5 26.4-13.0
7B-Instruct 39.028.811.6 43.0+4.0 35.2-3.7 26.4-12.6
32B-Instruct 48.035.516.0 51.6+3.6 43.8-4.2 38.8-9.2
32B-Instruct 50.637.716.0 55.2+4.5 44.8-5.9 40.3-10.3
72B-Instruct 54.040.118.0 57.0+3.0 52.7-1.3 41.1-12.9
72B-Instruct 52.439.018.3 56.9+4.5 47.9-4.4 38.8-13.6
Qwen3-VL
2B-Instruct 23.519.211.0 26.1+2.6 20.0-3.5 17.8-5.6
2B-Instruct 25.220.611.4 28.1+2.9 21.3-3.9 19.4-5.8
2B-Thinking 44.436.821.8 48.6+4.2 40.6-3.8 31.0-13.4
4B-Instruct 41.934.920.7 46.4+4.5 36.2-5.8 31.8-10.2
4B-Instruct 42.534.920.6 46.6+4.1 36.5-6.0 34.9-7.6
4B-Thinking 57.645.527.8 63.0+5.4 52.7-4.9 40.3-17.3
8B-Instruct 47.237.019.1 51.3+4.1 44.1-3.1 32.6-14.6
8B-Instruct 52.338.919.9 55.7+3.5 50.2-2.1 38.8-13.5
8B-Thinking 59.548.227.3 63.5+4.0 55.6-3.9 47.3-12.2
30B-A3B-Instruct 44.037.623.4 47.6+3.5 40.3-3.7 34.1-9.9
30B-A3B-Instruct 52.143.127.4 54.7+2.6 48.9-3.2 45.7-6.4
30B-A3B-Thinking 64.152.631.6 67.3+3.2 62.2-1.9 51.2-12.9
32B-Instruct 53.444.727.1 56.3+2.9 50.2-3.3 45.7-7.7
32B-Instruct 59.047.527.6 60.9+1.9 57.5-1.6 52.7-6.3
32B-Thinking 65.652.831.6 67.9+2.3 65.7+0.1 52.7-12.9
235B-A22B-Instruct 57.347.530.0 63.3+6.1 51.7-5.5 38.0-19.3
235B-A22B-Instruct 59.348.929.9 63.5+4.2 54.9-4.4 47.3-12.0
235B-A22B-Thinking 67.153.631.8 69.6+2.6 65.7-1.4 56.6-10.5
Qwen3.5
27B 67.354.932.7 69.9+2.7 65.7-1.5 56.6-10.7
35B-A3B 66.754.434.9 68.9+2.2 65.4-1.3 58.1-8.6
122B-A10B 67.255.035.1 69.9+2.8 66.3-0.8 54.3-12.9
397B-A17B-FP8 68.055.634.2 70.1+2.1 67.9-0.0 56.6-11.4
* Human expert results are evaluated on a randomly selected subset of the benchmark.

BibTeX

@inproceedings{
    dong2026refadv,
    title={Ref-Adv: Exploring {MLLM} Visual Reasoning in Referring Expression Tasks},
    author={Qihua Dong and Kuo Yang and Lin Ju and Handong Zhao and Yitian Zhang and Yizhou Wang and Huimin Zeng and Jianglin Lu and Yun Fu},
    booktitle={The Fourteenth International Conference on Learning Representations},
    year={2026},
    url={https://openreview.net/forum?id=iEBgrepR9i}
}