Vision–Language Models (VLMs) have advanced across multimodal benchmarks but still show clear gaps in ordinal number understanding—the ability to track relative positions and generalize to large indices. We present OrdinalBench, a diagnostic benchmark that standardizes ordinal number understanding as an evaluation task for VLMs. The core task is $N$-th object identification, defined by a starting reference and traversal rule. Task difficulty is controlled along three axes: (i) ordinal magnitude (from small numbers to extreme cases up to 300), (ii) arrangement complexity (from single loops to maze-like paths), and (iii) object count. The benchmark provides 39,000 question–answer pairs, each annotated with a ground-truth reasoning trajectory, balanced across difficulty, supporting controlled yet large-scale testing. Beyond answer-only evaluation, our framework requires models to generate structured, stepwise traces of the counting process and supplies an open evaluation toolkit measuring both final accuracy and step-level path consistency. Zero-shot evaluations of GPT-5, Gemini 2.5 Flash Lite, Qwen2.5-VL, InternVL3.5, and Molmo reveal sharp degradation under large-ordinal and complex-path conditions, highlighting weak generalization despite strong scores on standard multimodal tasks. By framing ordinal number understanding as a core target, OrdinalBench offers a reproducible testbed and actionable diagnostics to drive development of VLMs with stronger sequential reasoning.