Qian et al. (May 6, 2026) release CreativityBench, built on a knowledge base of 4,000 entities with 150,000+ affordance annotations linking objects, parts, physical attributes, and non-canonical uses. The 14,000-task benchmark requires identifying correct object parts and physical mechanisms to solve constrained creative problems—rather than relying on canonical usage patterns. Evaluating 10 frontier models, the authors find performance gains from scaling plateau quickly, general reasoning ability does not reliably transfer to affordance-based creativity, and chain-of-thought prompting yields limited improvement.

CreativityBench (arXiv:2605.02910): 14K Affordance-Based Creative Reasoning Tasks Show LLM Scaling Quickly Saturates; Chain-of-Thought Provides Minimal Gains

Citations