arxiv
Score 7.8
Why Global LLM Leaderboards Are Misleading: Small Portfolios for Heterogeneous Supervised ML
General AI
Ranking LLMs via pairwise human feedback underpins current leaderboards for open-ended tasks, such as creative writing and problem-solving. We analyze ~89K comparisons in 116 languages from 52 LLMs from Arena, and show that the best-fit global Bradley-Terry (BT) ranking is misleading. Nearly 2/3 of the decisive votes c…
- Review
- pending
- Role
- unreviewed
- Read
- soon