LisanBench

Leaderboard

Rankings, trends over time, and the metrics that separate strong models from weak ones.

#	Model	Company	Average	Weighted

Timeline

How the frontier moves over model release dates.

Reasoning Efficiency

Output tokens spent vs score achieved. Models below the frontier are spending tokens without proportional gains.

Chain Length vs Validity

Validity is the fraction of legal moves in a chain. Separates models that are long and clean from those that trade correctness for length.

Chain Length vs Difficulty

Average graph-theoretic hardness per move. Models navigating sparse, high-difficulty regions show stronger strategic planning.

Chain Length vs Difficulty-Weighted Score

Raw chain length vs the weighted score that rewards harder moves. Models above the diagonal punch above their weight.

Why They Fail

Failure modes, stop reasons, and the patterns that keep models from achieving longer chains.

About LisanBench

Code and details, why word chains work, and how the benchmark is scored.

Code & Details

More details, the benchmark implementation, and the source data are available on GitHub.

Why Word Chains?

Words as building blocks. Every language model is familiar with words. By making words the basic unit, benchmarking costs stay low and evaluation is trivial — no LLM judge, no human annotators. The benchmark scales effortlessly.

Simple rules, emergent complexity. The problem is easy to state, but difficulty emerges naturally and can be adjusted by changing the starting words. This makes it easy to differentiate between strong and weak models. The task is also open-ended — many benchmarks saturate after a few months, but here strong models can just keep scaling.

Capabilities that matter in practice. Word chains probe a combination of abilities that real-world tasks depend on. To score well, models need to follow instructions, draw on broad and sometimes obscure knowledge, plan ahead strategically, remember what they have already done, and keep going until the task is actually finished:

Rule Following — stay within the problem's constraints at every step
Knowledge — know enough words to navigate sparse regions
Planning — avoid short-sighted moves that lead to dead ends
Recall — keep track of what has already been used
Stamina — continue until no valid moves remain

These abilities compound rather than substitute for one another. Weakness in any one of them shows up quickly: models that stop early score poorly, models with weak recall repeat words, models without enough vocabulary get stuck in sparse regions, and models that know many words but plan poorly still fail to escape low-connectivity parts of the graph. That makes LisanBench difficult to game and better aligned with real-world usefulness than benchmarks that can be optimized around narrow tricks.

Methodology & Scoring

Each model is evaluated on the same 50 starting words, chosen to cover a range of difficulty. Some starting words, like hat, sit in dense parts of the graph with many valid continuations. Others, like countries, lie in much sparser regions, where valid moves are scarce and poor choices lead to dead ends more quickly. Each starting word is tested 3 times per model to reduce variance.

All chains are validated programmatically: each word must exist in the dictionary, each transition must be exactly edit-distance 1, and no repeats are allowed.

Average Chain Length — The number of valid words produced, summed across all 50 starting words and averaged over the 3 trials for each word. This gives a stable overall score.

Difficulty-Weighted Score — Each move is weighted by 1/branching_factor × rarity, where branching factor is the number of legal next moves and rarity is based on inverse word frequency. Models that successfully move through sparse, uncommon parts of the word graph score higher, rewarding path quality rather than raw length alone.

How good are LLMs at basic reasoning?

Leaderboard

Timeline

Reasoning Efficiency

Chain Length vs Validity

Chain Length vs Difficulty

Chain Length vs Difficulty-Weighted Score

Why They Fail

Chain Patterns

Exploration Drift

Edit Operations

Edit Type Distribution

Bridge Pattern Rate

Word Length Over Chain Steps

Explorer

About LisanBench

Code & Details

Why Word Chains?

Methodology & Scoring