Specialized agent benchmark - Public

Sotopia

Social intelligence benchmark placing agents in realistic social scenarios. Evaluates believability, social goal completion, relationship management, and secret keeping across 11 social dimensions.

BENCHMARK

Benchmark by Sotopia Lab

Benchmark type:: Public benchmark
Benchmark domain:: Specialized agent
Task count:: ~600 episodes
Evaluation method:: LLM judge (GPT-4)

Top model score: ~7.6/10
Human score: ~8.3/10

View Sotopia benchmark paper Sotopia GitHub repository

About this benchmark

SOTOPIA is an open-ended environment for evaluating social intelligence in language agents, created by researchers at CMU and presented as a spotlight at ICLR 2024. It contains 90 social scenarios spanning negotiation, collaboration, competition, and exchange, each paired with 10 role-playing agent combinations. Agents interact in multi-turn dialogues pursuing complex social goals while a dedicated evaluation framework, SOTOPIA-Eval, scores performance across seven dimensions drawn from social psychology, economics, and cognitive science, including goal completion, relationship building, believability, knowledge, and financial outcomes.

Evaluation uses both GPT-4 as an automated judge and human annotators, with scores on scales varying by dimension (e.g., 0-10 for goal completion, -5 to 5 for relationships). A challenging subset, SOTOPIA-hard, comprises 14 scenarios where all models struggle. On this subset, GPT-4 achieves significantly lower goal completion than humans and fails to demonstrate social commonsense reasoning and strategic communication. GPT-4 automated judgments correlate well with human ratings on several criteria.

SOTOPIA stands out as one of the few benchmarks targeting social intelligence rather than task completion or coding ability. The full platform, dataset, and evaluation code are publicly available on GitHub and Hugging Face under open-source licensing, with an interactive demo at sotopia.world.

Where this benchmark fits

Use this page when you need the benchmark-specific context. For side-by-side comparison, go back to the full registry or open the specialized view . You can also jump straight to this benchmark in the master registry list .