Papers
arxiv:2606.28322

PerceptionRubrics: Calibrating Multimodal Evaluation to Human Perception

Published on Jun 26
· Submitted by
Yana Wei
on Jul 2
#1 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

PerceptionRubrics presents a rubric-based evaluation framework that identifies gaps between benchmark scores and real-world performance through atomic auditing and gated scoring mechanisms.

We introduce PerceptionRubrics, a rubric-based evaluation framework that addresses the gap between saturated benchmark scores and real-world brittleness. Shifting evaluation from holistic semantic matching to rigorous atomic auditing, PerceptionRubrics pairs 1,038 information-dense images with over 12,000 instance-specific rubrics. These criteria are derived from golden captions constructed via a novel Circular Peer-Review consensus pipeline and then distilled into a dual-stream system of Must-Right (essential facts) and Easy-Wrong (fine-grained details) rubrics. Crucially, PerceptionRubrics implements a Gated Scoring mechanism: unlike linear averages, failure on mandatory visual facts triggers sharp binary penalties. Extensive evaluation yields critical insights: (1) The Reliability Gap: models often verify fragmented elements correctly yet fail strict conjunctive constraints, exposing brittleness in dense domains; (2) Open-Closed Stratification: contrary to reasoning trends, we reveal a persistent 8% perception deficit between open-source and proprietary frontiers; and (3) Human-Aligned Rigor: our gated metrics substantially out-align conventional benchmarks, validating that strict perceptual fidelity is the prerequisite for reliable generation.

Community

Paper submitter
edited 1 day ago

🚀 Multimodal large language models are showing increasingly saturated scores on perception benchmarks, making it harder to distinguish model rankings. However, in real-world use, they still make “unacceptable” visual mistakes: miscounting objects, misinterpreting spatial relations, missing key numbers in charts, or incorrectly recognizing buttons and text in UI interfaces.

These errors may be "diluted" by conventional average scores, but from a human perspective, a single critical factual error can make the entire response unreliable.

👉 We introduce PerceptionRubrics, a rubric-based evaluation framework for multimodal perception. It automatically decomposes complex image understanding into verifiable atomic visual facts, and designs two types of evaluation criteria together with a corresponding gated scoring metric:

Must-Right: core facts that the model must perceive correctly;
Easy-Wrong: fine-grained details that models are prone to omit, hallucinate, or misinterpret.

👉 Our benchmark contains 1,038 information-dense images and over 10,000 instance-specific rubrics, covering seven domains including natural scenes, OCR documents, GUIs, charts, STEM, logic puzzles, and creative/cultural images. We evaluate 20+ mainstream MLLMs, including GPT-5.5.

👉 Our results show that models can often recognize fragmented details correctly, yet fail to consistently satisfy multiple critical visual constraints. In particular, perceptual reliability remains a major bottleneck in information-dense scenarios such as GUIs, documents, and structured charts.

PerceptionRubrics provides a stricter, more diagnostic evaluation tool that better aligns with human perception, helping the community better understand and improve the visual reliability of multimodal models.
2

Paper submitter
edited 1 day ago

3
4

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.28322
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.28322 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.28322 in a Space README.md to link it from this page.

Collections including this paper 1