Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities

Ziwei Zhou1 Rui Wang1 Zuxuan Wu2,* Yu-Gang Jiang2,*
1Computation and Artificial Intelligence Innovative Collage, Fudan University
2Institute of Trustworthy Embodied AI, Fudan University
Paper GitHub 🤗 Dataset 🏆 Leaderboard

QA Examples

Examples of Daily-Omni Benchmark

Examples from our Daily-Omni benchmark, showcasing diverse audio-visual reasoning questions.

Generation Pipeline

Generation Pipeline

The Question-Answer generation pipeline of Daily-Omni.

Annotation Detail showing how we align audio and visual events.

Annotation Detail showing how we align audio and visual events.

Daily-Omni Agent

Baseline model

Architecture of Daily-Omni Agent.

Failure Cases

Visual Ablation Case

We showcase three diverse scenarios: (1) AV Temporal Alignment in a product review, (2) Cross-modal Reasoning in a vlog, and (3) Logical Inference in an educational video. Correct answers are highlighted in blue. The models’ predictions are marked with a green checkmark for correct answers and a red cross for incorrect ones, highlighting the persistent challenges in fine-grained multimodal understanding.

Leaderboard

Performance comparison of MLLMs on Daily-Omni. Random guess accuracy is 25%.

Abbreviations: AV Align = audio-visual alignment, Comp. = comparative, Ctx. Und. = context understanding, Evt. Seq. = event sequence, Infer. = inference, Reas. = reasoning.

Closed-source models are marked with (Closed) and open-source models with (Open).

Omni-Modal Language Models (With Visual and Audio)

Methods AV Align Comparison Context Understanding Event Sequence Inference Reasoning 30s 60s Avg
NVIDIA Nemotron 3 Nano Omni 30B A3B (Open) 67.6583.2165.8073.5383.7780.5774.8174.1874.52
Qwen3-Omni-30B-A3B-Thinking (Open) 65.9780.9265.8071.5785.0680.5776.0470.7373.60
Gemini 2.5 Flash (Closed) 73.8266.4172.0468.0378.6781.8769.8677.0973.06
Qwen3-Omni-30B-A3B-Instruct (Open) 66.8180.9264.7766.3481.1781.1471.8771.8271.85
Gemini 2.0 Flash (Closed) 62.1873.2863.7363.7276.6275.4367.2368.5567.84
Qwen2.5-Omni-7B-Instruct (Open) 48.3269.4758.5558.1776.6273.1464.6159.0962.07
Gemini 2.5 Flash Lite (Closed) 57.5668.7056.4852.6179.2269.7163.5260.0061.90
Daily-Omni-Baseline-Qwen2.5 (Open) 51.6868.7060.1053.9278.5771.4363.9959.2761.82
Gemini 2.0 Flash Lite (Closed) 55.0464.8958.0354.2574.0372.0062.4460.0061.32
Qwen2.5-Omni-3B-Instruct (Open) 50.8469.4753.8953.9275.9770.2962.6057.4560.23
Ola (7B) (Open) 40.3461.0740.4143.4663.6469.7151.4749.8250.71
VideoLLaMA2 (7B) (Open) 35.7135.8835.7531.7040.9134.2938.0231.8235.17
Unified-IO-2 XL (3B) (Open) 30.2530.5325.3929.0833.1221.7128.1328.5528.32
Unified-IO-2 XXL (8B) (Open) 25.6331.3026.4225.8235.0629.7126.7430.0028.24
Unified-IO-2 L (1B) (Open) 27.3122.9026.4227.7829.8729.1427.6727.0927.40

Omni-Modal Language Models (Visual Only)

Methods AV Align Comp. Ctx. Und. Evt. Seq. Infer. Reas. 30s 60s Avg
Qwen3-Omni-30B-A3B-Instruct (Open) 47.9067.1856.4853.9270.7861.1457.9657.6457.81-14.0
Qwen3-Omni-30B-A3B-Thinking (Open) 44.9664.8955.9659.4867.5360.0056.2659.4557.73-15.9
Gemini 2.0 Flash (Closed) 39.0864.1256.4856.2167.5362.2956.5755.4556.06-11.8
Gemini 2.0 Flash Lite (Closed) 43.7058.0253.8945.1064.2960.5753.0151.6452.38-8.9
Qwen2.5-Omni-7B-Instruct (Open) 34.4558.7847.6749.6762.9954.8648.6951.0949.79-12.3
Qwen2.5-Omni-3B-Instruct (Open) 37.3951.9144.5641.1864.2948.0046.5245.6446.12-14.1
Gemini 2.5 Flash (Closed) 37.5537.2140.4344.7857.0553.2942.3547.4644.61-28.5
Gemini 2.5 Flash Lite (Closed) 36.9745.8037.3139.5459.7447.4344.6741.2743.11-18.8

Omni-Modal Language Models (Audio Only)

Methods AV Align Comp. Ctx. Und. Evt. Seq. Infer. Reas. 30s 60s Avg
Qwen3-Omni-30B-A3B-Instruct (Open) 54.2069.4751.8151.6374.0378.8663.3758.1860.99-10.9
Qwen3-Omni-30B-A3B-Thinking (Open) 54.6267.9449.2251.3177.2777.7165.2255.2760.65-13.0
Gemini 2.5 Flash (Closed) 46.6455.7344.5642.4870.7878.8655.6452.1854.05-19.0
Gemini 2.5 Flash Lite (Closed) 42.0261.8341.9745.1068.8365.1454.2548.9151.80-10.1

Visual Language Models (Visual Only)

Methods AV Align Comp. Ctx. Und. Evt. Seq. Infer. Reas. 30s 60s Avg
Qwen3-VL-30B-A3B-Instruct (Open) 47.4868.7052.3355.8867.5361.1457.3457.2757.31
Qwen3-VL-8B-Instruct (Open) 44.5463.3650.7859.8069.4858.8656.4157.2756.81
GPT-4o (Closed) 47.9062.6052.3352.6166.2366.2955.6457.4556.47
Qwen3-VL-4B-Instruct (Open) 43.7061.0754.4053.2768.1858.8654.4056.0055.14
Qwen2.5-VL-7B-Instruct (Open) 36.9746.5633.6837.9151.9544.0039.2642.3640.68
Qwen2.5-VL-3B-Instruct (Open) 35.7143.5134.7233.6643.5139.4337.7137.0937.43

Audio Language Models (Audio Only)

Methods AV Align Comp. Ctx. Und. Evt. Seq. Infer. Reas. 30s 60s Avg
Audio Flamingo 3 (7B) (Open) 40.7655.7343.0140.5265.5868.0050.2349.4549.87
Qwen2-Audio (7B) (Open) 28.9935.8827.4632.0333.7733.1431.2231.8231.50

Textual Language Models (Without Visual and Audio)

Methods AV Align Comp. Ctx. Und. Evt. Seq. Infer. Reas. 30s 60s Avg
GPT-4o (Closed) 33.1943.5128.5030.3944.8146.8636.4836.1836.34
Deepseek-V3 (671B) (Closed) 31.9341.2229.0229.4144.8146.2935.2436.0035.59
Qwen2.5-14B-Instruct (Open) 30.2539.6927.9828.4342.2142.8632.1535.8233.83

Citation


@misc{zhou2025dailyomni,
      title={Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities}, 
      author={Ziwei Zhou and Rui Wang and Zuxuan Wu},
      year={2025},
      eprint={2505.17862},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2505.17862}, 
}