General reasoning benchmark - Public

OdysseyBench

Long-horizon agent benchmark requiring sustained reasoning and planning over 50+ steps. Tests whether agents can maintain coherent goals across very long task horizons without losing context.

BENCHMARK

Benchmark type:: Public benchmark
Benchmark domain:: General reasoning
Task count:: ~200
Evaluation method:: Functional

Top model score: ~35%
Human score: N/A

View OdysseyBench benchmark paper

About this benchmark

OdysseyBench is a benchmark for evaluating multimodal large language models on long-form video understanding tasks, introduced in 2024. It focuses on assessing whether models can comprehend, reason about, and answer questions regarding extended video content that requires sustained attention and temporal reasoning across many frames. The benchmark targets a gap in existing evaluations, which primarily test short video clips or single-image understanding.

Tasks in OdysseyBench require models to process lengthy video inputs and answer questions that demand understanding of plot progression, character interactions, causal relationships, and temporal ordering across extended sequences. Evaluation uses objective scoring against ground-truth answers derived from the source video content. The benchmark tests both perception (what happens in the video) and reasoning (why events occur and what follows), challenging models to integrate visual and linguistic information over long temporal spans.

OdysseyBench is significant because long-form video understanding represents a frontier challenge for multimodal AI systems, requiring capabilities that go well beyond frame-level image captioning. As video-capable models become more prevalent, benchmarks that test genuine temporal comprehension over extended content become essential for measuring real progress. The benchmark helps distinguish models that truly understand video narrative from those relying on superficial visual cues.

Where this benchmark fits

Use this page when you need the benchmark-specific context. For side-by-side comparison, go back to the full registry or open the general reasoning view . You can also jump straight to this benchmark in the master registry list .