LisanBench

Leaderboard

Rankings, release-date trends, efficiency, and the comparison plots that show how models trade off score, validity, and difficulty.

—

Models Tested

—

Top Score

—

Starting Words

Each model is tested 3 times per starting word. Scores are averaged across all trials.

#	Model	Company	Path Length	Difficulty-Weighted

Timeline

How the frontier moves over model release dates.

Reasoning Efficiency

Output tokens spent vs score achieved. Models below the frontier are spending tokens without proportional gains.

Path Length vs Validity

Validity is the fraction of legal moves in a chain. Separates models that are long and clean from those that trade correctness for length.

Path Length vs Difficulty

Average graph-theoretic hardness per move. Models navigating sparse, high-difficulty regions show stronger strategic planning.

Path Length vs Difficulty-Weighted Score

Path length vs difficulty-weighted score. Models above the diagonal perform harder moves for a given path length.

Why They Fail

Failure modes, stop reasons, and the patterns that keep models from achieving longer chains.

About LisanBench

Code and details, why word chains work, and how the benchmark is scored.

Code & Details

More details, the benchmark implementation, and the source data are available on GitHub.

Why Word Chains?

Words as building blocks. Every language model is familiar with words. By making words the basic unit, evaluation becomes trivial: each answer is checked with simple string matching and a dictionary lookup. There is no LLM judge, no rubric-writing, and no human annotation.

Simple rules, emergent complexity. The problem is easy to state, but difficulty emerges naturally and can be adjusted by changing the starting words. The benchmark also scales effortlessly: you can add more starting words, more trials, or even switch languages without redesigning the task or changing how evaluation works. If you want to isolate reasoning more directly, you can also provide all valid next moves at each step and measure whether the model chooses well.

Why complexity emerges. As stronger models produce longer chains, the task gets harder on its own. They have to track a longer history, recall more accurately which words have already been used, and avoid painting themselves into a corner. Every chosen word removes one future node from the search space, so each move changes the structure of what remains.

Capabilities that matter in practice. Word chains probe a combination of abilities that real-world tasks depend on. To score well, models need to follow instructions, draw on broad and sometimes obscure knowledge, plan ahead strategically, remember what they have already done, and keep going until the task is actually finished:

Rule Following — stay within the problem's constraints at every step
Knowledge — know enough words to navigate sparse regions
Planning — avoid short-sighted moves that lead to dead ends
Recall — keep track of what has already been used
Persistence — continue until no valid moves remain

These abilities compound rather than substitute for one another. Weakness in any one of them shows up quickly: models that stop early score poorly, models with weak recall repeat words, models without enough vocabulary get stuck in sparse regions, and models that know many words but plan poorly still fail to escape low-connectivity parts of the graph. That makes LisanBench difficult to game and better aligned with real-world usefulness than benchmarks that can be optimized around narrow tricks.

Methodology & Scoring

Each model is evaluated on the same 50 starting words, chosen to cover a range of difficulty. Some starting words, like hat, sit in dense parts of the graph with many valid continuations. Others, like countries, lie in much sparser regions, where valid moves are scarce and poor choices lead to dead ends more quickly. Each starting word is tested 3 times per model to reduce variance.

All chains are validated programmatically: each word must exist in the dictionary, each transition must be exactly edit-distance 1, and no repeats are allowed.

Average Path Length — The number of valid edit-distance-1 transitions from the required starting word, summed across all 50 starting words and averaged over the 3 trials for each word. This gives a stable overall score.

Difficulty-Weighted Score — Each move is weighted by 1/branching_factor × rarity, where branching factor is the number of legal next moves and rarity is based on inverse word frequency. Models that successfully move through sparse, uncommon parts of the word graph score higher, rewarding path quality rather than raw length alone.

How good are LLMs at basic reasoning?

Leaderboard

Timeline

Reasoning Efficiency

Path Length vs Validity

Path Length vs Difficulty

Path Length vs Difficulty-Weighted Score

Why They Fail

Chain Patterns

Exploration Drift

Edit Operations

Edit Type Distribution

Bridge Pattern Rate

Word Length Over Chain Steps

Explorer

About LisanBench

Code & Details

Why Word Chains?

Methodology & Scoring