Ebouky et al. (May 8, 2026) present GazeVLM, a 4-billion-parameter vision-language model that generates special gaze tokens to dynamically control its causal attention mask—enabling fluid transitions between focused regional inspection and global scene awareness without image cropping or context-window expansion. Trained with Group Relative Policy Optimization (GRPO) using spatial grounding rewards, the model surpasses state-of-the-art VLMs in its parameter class by approximately 4% and agentic multimodal pipelines by more than 5% on HRBench-4k and HRBench-8k high-resolution benchmarks.

GazeVLM (arXiv:2605.07817): 4B VLM Adds Trainable Gaze Tokens for Dynamic Causal Attention Control; Beats Comparable VLMs by ~4% and Agentic Pipelines by >5% on HRBench

Citations