LisanBench

How good are LLMs at basic reasoning?

LisanBench measures rule-following, knowledge, planning, recall and stamina in a single task: build the longest chain of non-repeating English words where each word differs from the last by exactly one letter. Learn more.

Example share → shore → store → score → …
Models Tested
Top Score
Starting Words

Leaderboard

Rankings, trends over time, and the metrics that separate strong models from weak ones.

# Model Company Average Weighted

Timeline

How the frontier moves over model release dates.


Reasoning Efficiency

Output tokens spent vs score achieved. Models below the frontier are spending tokens without proportional gains.


Chain Length vs Validity

Validity is the fraction of legal moves in a chain. Separates models that are long and clean from those that trade correctness for length.


Chain Length vs Difficulty

Average graph-theoretic hardness per move. Models navigating sparse, high-difficulty regions show stronger strategic planning.


Chain Length vs Difficulty-Weighted Score

Raw chain length vs the weighted score that rewards harder moves. Models above the diagonal punch above their weight.