LisanBench

How good are LLMs at basic reasoning?

LisanBench asks models to build the longest chain of non-repeating English words where each word differs from the last by exactly one letter. It measures rule-following, knowledge, planning, recall and persistence. Learn more.

Example love insert r lover substitute l cover delete r cover invalid cure
Path Length = 3 Number of valid transitions from the required starting word.
Difficulty-Weighted = 0.52 Common words with many possible next words are given less weight.

Leaderboard

Rankings, release-date trends, efficiency, and the comparison plots that show how models trade off score, validity, and difficulty.

Models Tested
Top Score
Starting Words

Each model is tested 3 times per starting word. Scores are averaged across all trials.

# Model Company Path Length Difficulty-Weighted

Timeline

How the frontier moves over model release dates.


Reasoning Efficiency

Output tokens spent vs score achieved. Models below the frontier are spending tokens without proportional gains.


Path Length vs Validity

Validity is the fraction of legal moves in a chain. Separates models that are long and clean from those that trade correctness for length.


Path Length vs Difficulty

Average graph-theoretic hardness per move. Models navigating sparse, high-difficulty regions show stronger strategic planning.


Path Length vs Difficulty-Weighted Score

Path length vs difficulty-weighted score. Models above the diagonal perform harder moves for a given path length.