Leaderboard
Rankings, release-date trends, efficiency, and the comparison plots that show how models trade off score, validity, and difficulty.
Each model is tested 3 times per starting word. Scores are averaged across all trials.
| # | Model | Company | Path Length | Difficulty-Weighted |
|---|
Timeline
How the frontier moves over model release dates.
Reasoning Efficiency
Output tokens spent vs score achieved. Models below the frontier are spending tokens without proportional gains.
Path Length vs Validity
Validity is the fraction of legal moves in a chain. Separates models that are long and clean from those that trade correctness for length.
Path Length vs Difficulty
Average graph-theoretic hardness per move. Models navigating sparse, high-difficulty regions show stronger strategic planning.
Path Length vs Difficulty-Weighted Score
Path length vs difficulty-weighted score. Models above the diagonal perform harder moves for a given path length.
Why They Fail
Failure modes, stop reasons, and the patterns that keep models from achieving longer chains.
Chain Patterns
How chains unfold step by step, from exploration drift to edit choices and movement through the graph.
Exploration Drift
Average distance to the previous 20 words in the chain.
Edit Operations
What types of edits models prefer and how they sequence them.
Edit Type Distribution
Share of insertions, deletions, and substitutions in each model's chains.
Bridge Pattern Rate
How often models use the insert-substitute-delete bridge pattern.
Word Length Over Chain Steps
Longest-trial word length by step for each starting word.
Explorer
Browse the full path each model takes for every starting word, word by word through the graph.
In 3D mode, nearby words are close in the word graph: words with shorter graph distance and tighter local connectivity are placed nearer each other. The absolute axes and orientation do not mean anything.
About LisanBench
Code and details, why word chains work, and how the benchmark is scored.
Code & Details
More details, the benchmark implementation, and the source data are available on GitHub.
Why Word Chains?
Words as building blocks. Every language model is familiar with words. By making words the basic unit, evaluation becomes trivial: each answer is checked with simple string matching and a dictionary lookup. There is no LLM judge, no rubric-writing, and no human annotation.
Simple rules, emergent complexity. The problem is easy to state, but difficulty emerges naturally and can be adjusted by changing the starting words. The benchmark also scales effortlessly: you can add more starting words, more trials, or even switch languages without redesigning the task or changing how evaluation works. If you want to isolate reasoning more directly, you can also provide all valid next moves at each step and measure whether the model chooses well.
Why complexity emerges. As stronger models produce longer chains, the task gets harder on its own. They have to track a longer history, recall more accurately which words have already been used, and avoid painting themselves into a corner. Every chosen word removes one future node from the search space, so each move changes the structure of what remains.
Capabilities that matter in practice. Word chains probe a combination of abilities that real-world tasks depend on. To score well, models need to follow instructions, draw on broad and sometimes obscure knowledge, plan ahead strategically, remember what they have already done, and keep going until the task is actually finished:
- Rule Following — stay within the problem's constraints at every step
- Knowledge — know enough words to navigate sparse regions
- Planning — avoid short-sighted moves that lead to dead ends
- Recall — keep track of what has already been used
- Persistence — continue until no valid moves remain
These abilities compound rather than substitute for one another. Weakness in any one of them shows up quickly: models that stop early score poorly, models with weak recall repeat words, models without enough vocabulary get stuck in sparse regions, and models that know many words but plan poorly still fail to escape low-connectivity parts of the graph. That makes LisanBench difficult to game and better aligned with real-world usefulness than benchmarks that can be optimized around narrow tricks.
Methodology & Scoring
Each model is evaluated on the same 50 starting words, chosen to cover a range of difficulty. Some starting words, like hat, sit in dense parts of the graph with many valid continuations. Others, like countries, lie in much sparser regions, where valid moves are scarce and poor choices lead to dead ends more quickly. Each starting word is tested 3 times per model to reduce variance.
All chains are validated programmatically: each word must exist in the dictionary, each transition must be exactly edit-distance 1, and no repeats are allowed.
Average Path Length — The number of valid edit-distance-1 transitions from the required starting word, summed across all 50 starting words and averaged over the 3 trials for each word. This gives a stable overall score.
Difficulty-Weighted Score — Each move is weighted by 1/branching_factor × rarity, where branching factor is the number of legal next moves and rarity is based on inverse word frequency. Models that successfully move through sparse, uncommon parts of the word graph score higher, rewarding path quality rather than raw length alone.