AgentEscapeBench, published May 8, 2026, presents 270 escape-room-style tasks across five difficulty tiers where agents must invoke real external tools, track incrementally revealed hidden state, and propagate intermediate results through directed dependency graphs—without relying on domain prior knowledge or fixed workflow templates. Tested across 16 LLM agents, the best model declined from 90.0% success at tier-5 to 60.0% at tier-25; humans declined from 98.3% to 80.0%. Primary failure modes are state tracking across extended sequences and intermediate result propagation.

AgentEscapeBench (arXiv:2605.07926): 270-Instance Escape-Room Benchmark Finds Best LLM Agent Drops from 90% to 60% as Tool Dependency Depth Increases

Citations