Examples from our Daily-Omni benchmark, showcasing diverse audio-visual reasoning questions.
The Question-Answer generation pipeline of Daily-Omni.
Annotation Detail showing how we align audio and visual events.
Architecture of Daily-Omni Agent.
A case study on the importance of temporal alignment for visual reasoning.
Methods | AV Event Alignment |
Comparative | Context Understanding |
Event Sequence |
Inference | Reasoning | 30s Subset |
60s Subset |
Avg |
---|---|---|---|---|---|---|---|---|---|
Omni-Modal Language Models (With Visual and Audio) | |||||||||
Gemini 2.0 Flash | 62.18 | 73.28 | 63.73 | 63.72 | 76.62 | 75.43 | 67.23 | 68.55 | 67.84 |
Daily-Omni (ours) | 51.68 | 68.70 | 60.10 | 53.92 | 78.57 | 71.43 | 63.99 | 59.27 | 61.82 |
Gemini 2.0 Flash Lite | 55.04 | 64.89 | 58.03 | 54.25 | 74.03 | 72.00 | 62.44 | 60.00 | 61.32 |
Ola (7B) | 40.34 | 61.07 | 40.41 | 43.46 | 63.64 | 69.71 | 51.47 | 49.82 | 50.71 |
Qwen2.5-Omni (7B) | 44.12 | 51.15 | 38.86 | 40.52 | 57.79 | 61.71 | 46.68 | 48.36 | 47.45 |
Qwen2.5-Omni (3B) | 38.66 | 48.09 | 33.68 | 33.99 | 54.55 | 44.00 | 42.35 | 38.36 | 40.52 |
VideoLLaMA2 (7B) | 35.71 | 35.88 | 35.75 | 31.70 | 40.91 | 34.29 | 38.02 | 31.82 | 35.17 |
Unified-IO-2 XL (3B) | 30.25 | 30.53 | 25.39 | 29.08 | 33.12 | 21.71 | 28.13 | 28.55 | 28.32 |
Unified-IO-2 XXL (8B) | 25.63 | 31.30 | 26.42 | 25.82 | 35.06 | 29.71 | 26.74 | 30.00 | 28.24 |
Unified-IO-2 L (1B) | 27.31 | 22.90 | 26.42 | 27.78 | 29.87 | 29.14 | 27.67 | 27.09 | 27.40 |
@misc{zhou2025dailyomni,
title={Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities},
author={Ziwei Zhou and Rui Wang and Zuxuan Wu},
year={2025},
eprint={2505.17862},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2505.17862},
}