Leaderboard
Rankings, trends over time, and the metrics that separate strong models from weak ones.
| # | Model | Company | Average | Weighted |
|---|
Timeline
How the frontier moves over model release dates.
Reasoning Efficiency
Output tokens spent vs score achieved. Models below the frontier are spending tokens without proportional gains.
Chain Length vs Validity
Validity is the fraction of legal moves in a chain. Separates models that are long and clean from those that trade correctness for length.
Chain Length vs Difficulty
Average graph-theoretic hardness per move. Models navigating sparse, high-difficulty regions show stronger strategic planning.
Chain Length vs Difficulty-Weighted Score
Raw chain length vs the weighted score that rewards harder moves. Models above the diagonal punch above their weight.
Why They Fail
Failure modes, stop reasons, and the patterns that keep models from achieving longer chains.
Chain Patterns
How chains unfold step by step, from exploration drift to edit choices and movement through the graph.
Exploration Drift
Average distance to the previous 20 words in the chain.
Edit Operations
What types of edits models prefer and how they sequence them.
Edit Type Distribution
Share of insertions, deletions, and substitutions in each model's chains.
Bridge Pattern Rate
How often models use the insert-substitute-delete bridge pattern.
Word Length Over Chain Steps
Longest-trial word length by step for each starting word.
Explorer
Browse the full path each model takes for every starting word, word by word through the graph.
About LisanBench
Code and details, why word chains work, and how the benchmark is scored.
Code & Details
More details, the benchmark implementation, and the source data are available on GitHub.
Why Word Chains?
Words as building blocks. Every language model is familiar with words. By making words the basic unit, benchmarking costs stay low and evaluation is trivial — no LLM judge, no human annotators. The benchmark scales effortlessly.
Simple rules, emergent complexity. The problem is easy to state, but difficulty emerges naturally and can be adjusted by changing the starting words. This makes it easy to differentiate between strong and weak models. The task is also open-ended — many benchmarks saturate after a few months, but here strong models can just keep scaling.
Capabilities that matter in practice. Word chains probe a combination of abilities that real-world tasks depend on. To score well, models need to follow instructions, draw on broad and sometimes obscure knowledge, plan ahead strategically, remember what they have already done, and keep going until the task is actually finished:
- Rule Following — stay within the problem's constraints at every step
- Knowledge — know enough words to navigate sparse regions
- Planning — avoid short-sighted moves that lead to dead ends
- Recall — keep track of what has already been used
- Stamina — continue until no valid moves remain
These abilities compound rather than substitute for one another. Weakness in any one of them shows up quickly: models that stop early score poorly, models with weak recall repeat words, models without enough vocabulary get stuck in sparse regions, and models that know many words but plan poorly still fail to escape low-connectivity parts of the graph. That makes LisanBench difficult to game and better aligned with real-world usefulness than benchmarks that can be optimized around narrow tricks.
Methodology & Scoring
Each model is evaluated on the same 50 starting words, chosen to cover a range of difficulty. Some starting words, like hat, sit in dense parts of the graph with many valid continuations. Others, like countries, lie in much sparser regions, where valid moves are scarce and poor choices lead to dead ends more quickly. Each starting word is tested 3 times per model to reduce variance.
All chains are validated programmatically: each word must exist in the dictionary, each transition must be exactly edit-distance 1, and no repeats are allowed.
Average Chain Length — The number of valid words produced, summed across all 50 starting words and averaged over the 3 trials for each word. This gives a stable overall score.
Difficulty-Weighted Score — Each move is weighted by 1/branching_factor × rarity, where branching factor is the number of legal next moves and rarity is based on inverse word frequency. Models that successfully move through sparse, uncommon parts of the word graph score higher, rewarding path quality rather than raw length alone.