MazeEval: a benchmark for testing sequential decision-making in language models
Einarsson, H. · Proceedings of the fifteenth language resources and evaluation conference (LREC 2026) · 2026
MazeEval is a benchmark for testing sequential decision-making in language models through navigation tasks in procedurally generated mazes. The setup probes whether models can maintain a coherent internal map across multiple turns of interaction rather than relying on local pattern matching. Single-author paper published at LREC 2026.