Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities

Ziwei Zhou1 Rui Wang1 Zuxuan Wu2,*
1Computation and Artificial Intelligence Innovative Collage, Fudan University
2Institute of Trustworthy Embodied AI, Fudan University
Paper GitHub 🤗 Dataset 🏆 Leaderboard

QA Examples

Examples of Daily-Omni Benchmark

Examples from our Daily-Omni benchmark, showcasing diverse audio-visual reasoning questions.

Generation Pipeline

Generation Pipeline

The Question-Answer generation pipeline of Daily-Omni.

Annotation Detail showing how we align audio and visual events.

Annotation Detail showing how we align audio and visual events.

Daily-Omni Agent

Baseline model

Architecture of Daily-Omni Agent.

Model Performance

Visual Ablation Case

A case study on the importance of temporal alignment for visual reasoning.

Leaderboard

Methods AV Event
Alignment
Comparative Context
Understanding
Event
Sequence
Inference Reasoning 30s
Subset
60s
Subset
Avg
Omni-Modal Language Models (With Visual and Audio)
Gemini 2.0 Flash 62.1873.2863.7363.7276.6275.4367.2368.5567.84
Daily-Omni (ours) 51.68 68.70 60.10 53.92 78.57 71.43 63.99 59.27 61.82
Gemini 2.0 Flash Lite 55.0464.8958.0354.2574.0372.0062.4460.0061.32
Ola (7B) 40.3461.0740.4143.4663.6469.7151.4749.8250.71
Qwen2.5-Omni (7B) 44.1251.1538.8640.5257.7961.7146.6848.3647.45
Qwen2.5-Omni (3B) 38.6648.0933.6833.9954.5544.0042.3538.3640.52
VideoLLaMA2 (7B) 35.7135.8835.7531.7040.9134.2938.0231.8235.17
Unified-IO-2 XL (3B) 30.2530.5325.3929.0833.1221.7128.1328.5528.32
Unified-IO-2 XXL (8B) 25.6331.3026.4225.8235.0629.7126.7430.0028.24
Unified-IO-2 L (1B) 27.3122.9026.4227.7829.8729.1427.6727.0927.40

Citation


@misc{zhou2025dailyomni,
      title={Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities}, 
      author={Ziwei Zhou and Rui Wang and Zuxuan Wu},
      year={2025},
      eprint={2505.17862},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2505.17862}, 
}