Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities

Ziwei Zhou¹ Rui Wang¹ Zuxuan Wu^2,* Yu-Gang Jiang^2,*

¹Computation and Artificial Intelligence Innovative Collage, Fudan University

²Institute of Trustworthy Embodied AI, Fudan University

Paper GitHub 🤗 Dataset 🏆 Leaderboard

QA Examples

Examples from our Daily-Omni benchmark, showcasing diverse audio-visual reasoning questions.

Generation Pipeline

The Question-Answer generation pipeline of Daily-Omni.

Annotation Detail showing how we align audio and visual events.

Daily-Omni Agent

Architecture of Daily-Omni Agent.

Failure Cases

We showcase three diverse scenarios: (1) AV Temporal Alignment in a product review, (2) Cross-modal Reasoning in a vlog, and (3) Logical Inference in an educational video. Correct answers are highlighted in blue. The models’ predictions are marked with a green checkmark for correct answers and a red cross for incorrect ones, highlighting the persistent challenges in fine-grained multimodal understanding.

Leaderboard

Performance comparison of MLLMs on Daily-Omni. Random guess accuracy is 25%.

Abbreviations: AV Align = audio-visual alignment, Comp. = comparative, Ctx. Und. = context understanding, Evt. Seq. = event sequence, Infer. = inference, Reas. = reasoning.

Closed-source models are marked with (Closed) and open-source models with (Open).

Omni-Modal Language Models (With Visual and Audio)

Methods	AV Align	Comparison	Context Understanding	Event Sequence	Inference	Reasoning	30s	60s	Avg
NVIDIA Nemotron 3 Nano Omni 30B A3B (Open)	67.65	83.21	65.80	73.53	83.77	80.57	74.81	74.18	74.52
Qwen3-Omni-30B-A3B-Thinking (Open)	65.97	80.92	65.80	71.57	85.06	80.57	76.04	70.73	73.60
Gemini 2.5 Flash (Closed)	73.82	66.41	72.04	68.03	78.67	81.87	69.86	77.09	73.06
Qwen3-Omni-30B-A3B-Instruct (Open)	66.81	80.92	64.77	66.34	81.17	81.14	71.87	71.82	71.85
Gemini 2.0 Flash (Closed)	62.18	73.28	63.73	63.72	76.62	75.43	67.23	68.55	67.84
Qwen2.5-Omni-7B-Instruct (Open)	48.32	69.47	58.55	58.17	76.62	73.14	64.61	59.09	62.07
Gemini 2.5 Flash Lite (Closed)	57.56	68.70	56.48	52.61	79.22	69.71	63.52	60.00	61.90
Daily-Omni-Baseline-Qwen2.5 (Open)	51.68	68.70	60.10	53.92	78.57	71.43	63.99	59.27	61.82
Gemini 2.0 Flash Lite (Closed)	55.04	64.89	58.03	54.25	74.03	72.00	62.44	60.00	61.32
Qwen2.5-Omni-3B-Instruct (Open)	50.84	69.47	53.89	53.92	75.97	70.29	62.60	57.45	60.23
Ola (7B) (Open)	40.34	61.07	40.41	43.46	63.64	69.71	51.47	49.82	50.71
VideoLLaMA2 (7B) (Open)	35.71	35.88	35.75	31.70	40.91	34.29	38.02	31.82	35.17
Unified-IO-2 XL (3B) (Open)	30.25	30.53	25.39	29.08	33.12	21.71	28.13	28.55	28.32
Unified-IO-2 XXL (8B) (Open)	25.63	31.30	26.42	25.82	35.06	29.71	26.74	30.00	28.24
Unified-IO-2 L (1B) (Open)	27.31	22.90	26.42	27.78	29.87	29.14	27.67	27.09	27.40

Omni-Modal Language Models (Visual Only)

Methods	AV Align	Comp.	Ctx. Und.	Evt. Seq.	Infer.	Reas.	30s	60s	Avg
Qwen3-Omni-30B-A3B-Instruct (Open)	47.90	67.18	56.48	53.92	70.78	61.14	57.96	57.64	57.81_-14.0
Qwen3-Omni-30B-A3B-Thinking (Open)	44.96	64.89	55.96	59.48	67.53	60.00	56.26	59.45	57.73_-15.9
Gemini 2.0 Flash (Closed)	39.08	64.12	56.48	56.21	67.53	62.29	56.57	55.45	56.06_-11.8
Gemini 2.0 Flash Lite (Closed)	43.70	58.02	53.89	45.10	64.29	60.57	53.01	51.64	52.38_-8.9
Qwen2.5-Omni-7B-Instruct (Open)	34.45	58.78	47.67	49.67	62.99	54.86	48.69	51.09	49.79_-12.3
Qwen2.5-Omni-3B-Instruct (Open)	37.39	51.91	44.56	41.18	64.29	48.00	46.52	45.64	46.12_-14.1
Gemini 2.5 Flash (Closed)	37.55	37.21	40.43	44.78	57.05	53.29	42.35	47.46	44.61_-28.5
Gemini 2.5 Flash Lite (Closed)	36.97	45.80	37.31	39.54	59.74	47.43	44.67	41.27	43.11_-18.8

Omni-Modal Language Models (Audio Only)

Methods	AV Align	Comp.	Ctx. Und.	Evt. Seq.	Infer.	Reas.	30s	60s	Avg
Qwen3-Omni-30B-A3B-Instruct (Open)	54.20	69.47	51.81	51.63	74.03	78.86	63.37	58.18	60.99_-10.9
Qwen3-Omni-30B-A3B-Thinking (Open)	54.62	67.94	49.22	51.31	77.27	77.71	65.22	55.27	60.65_-13.0
Gemini 2.5 Flash (Closed)	46.64	55.73	44.56	42.48	70.78	78.86	55.64	52.18	54.05_-19.0
Gemini 2.5 Flash Lite (Closed)	42.02	61.83	41.97	45.10	68.83	65.14	54.25	48.91	51.80_-10.1

Visual Language Models (Visual Only)

Methods	AV Align	Comp.	Ctx. Und.	Evt. Seq.	Infer.	Reas.	30s	60s	Avg
Qwen3-VL-30B-A3B-Instruct (Open)	47.48	68.70	52.33	55.88	67.53	61.14	57.34	57.27	57.31
Qwen3-VL-8B-Instruct (Open)	44.54	63.36	50.78	59.80	69.48	58.86	56.41	57.27	56.81
GPT-4o (Closed)	47.90	62.60	52.33	52.61	66.23	66.29	55.64	57.45	56.47
Qwen3-VL-4B-Instruct (Open)	43.70	61.07	54.40	53.27	68.18	58.86	54.40	56.00	55.14
Qwen2.5-VL-7B-Instruct (Open)	36.97	46.56	33.68	37.91	51.95	44.00	39.26	42.36	40.68
Qwen2.5-VL-3B-Instruct (Open)	35.71	43.51	34.72	33.66	43.51	39.43	37.71	37.09	37.43

Audio Language Models (Audio Only)

Methods	AV Align	Comp.	Ctx. Und.	Evt. Seq.	Infer.	Reas.	30s	60s	Avg
Audio Flamingo 3 (7B) (Open)	40.76	55.73	43.01	40.52	65.58	68.00	50.23	49.45	49.87
Qwen2-Audio (7B) (Open)	28.99	35.88	27.46	32.03	33.77	33.14	31.22	31.82	31.50

Textual Language Models (Without Visual and Audio)

Methods	AV Align	Comp.	Ctx. Und.	Evt. Seq.	Infer.	Reas.	30s	60s	Avg
GPT-4o (Closed)	33.19	43.51	28.50	30.39	44.81	46.86	36.48	36.18	36.34
Deepseek-V3 (671B) (Closed)	31.93	41.22	29.02	29.41	44.81	46.29	35.24	36.00	35.59
Qwen2.5-14B-Instruct (Open)	30.25	39.69	27.98	28.43	42.21	42.86	32.15	35.82	33.83

Citation


@misc{zhou2025dailyomni,
      title={Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities}, 
      author={Ziwei Zhou and Rui Wang and Zuxuan Wu},
      year={2025},
      eprint={2505.17862},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2505.17862}, 
}