DiaDem iconDiaDem: Advancing Dialogue Descriptions in Audiovisual Video Captioning for Multimodal Large Language Models

Xinlong Chen1,2,3*, Weihong Lin3, Jingyun Hua3, Linli Yao4,
Yue Ding1,2, Bozhou Li4, Bohan Zeng4, Yang Shi4,
Qiang Liu1,2†, Yuanxing Zhang3, Pengfei Wan3, Liang Wang1,2, Tieniu Tan1,2,5
1New Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences
2School of Artificial Intelligence, University of Chinese Academy of Sciences
3Kling Team, Kuaishou Technology    4Peking University    5Nanjing University

*This work was conducted during the author's internship at Kling Team, Kuaishou Technology

Corresponding author: qiang.liu@nlpr.ia.ac.cn

Abstract

Accurate dialogue description is a critical yet underexplored aspect of audiovisual video captioning, with profound implications for downstream multimodal understanding and generation tasks. Despite the rapid progress in MLLMs, existing approaches often struggle to faithfully capture who says what in complex audiovisual scenes. To mitigate this limitation, we propose DiaDem, a powerful audiovisual video captioning model capable of generating captions with more precise dialogue descriptions, while maintaining strong overall captioning performance across general audiovisual content.
To enable systematic evaluation of dialogue description capabilities, we further introduce DiaDemBench, a comprehensive benchmark designed to evaluate models across diverse dialogue scenarios, emphasizing both speaker attribution accuracy and utterance transcription fidelity in audiovisual captions. Extensive experiments on DiaDemBench reveal that even commercial models still exhibit substantial room for improvement in dialogue-aware captioning. Notably, DiaDem not only outperforms the Gemini series in dialogue description accuracy but also achieves competitive performance on general audiovisual captioning benchmarks, demonstrating its overall effectiveness.

Click to view the full generated caption ...The background is a brightly lit room with a window covered by blinds. The woman speaks in a gentle, slightly hopeful tone, "Yeah. I thought maybe a change of scenery would cheer him up."...

Full caption:

A medium close-up shot from behind a glass terrarium shows a woman with wavy, shoulder-length blonde hair. She looks down into the enclosure with a gentle, slightly smiling expression. Inside the terrarium, which appears to be a repurposed aquarium, there is a piece of driftwood, some small green plants, and two strange, spider-like objects made of what looks like matchsticks or sticks. The background is a brightly lit room with a window covered by blinds. The woman speaks in a gentle, slightly hopeful tone, "Yeah. I thought maybe a change of scenery would cheer him up." The camera cuts to a medium shot of a man standing in a kitchen. He has short brown hair and is dressed in a blue t-shirt under an unbuttoned blue and white plaid shirt. He looks off-screen with a serious, slightly skeptical expression. The kitchen features white cabinets, a granite countertop, and a stainless steel refrigerator adorned with magnets. The man replies with dry sarcasm, "Maybe he would enjoy Bangkok." The view returns to the woman and a younger girl with blonde hair peeking over the top of the glass terrarium. The woman maintains a pleasant expression, while the younger girl looks on with a curious and slightly wide-eyed look. The younger girl interjects with a tone of curious surprise, "Is that Ramona's ball creature?" The final shot is a medium shot of the man in the kitchen. He now looks directly at the camera with a deadpan, unimpressed expression, his hands resting near his waist. The kitchen setting remains the same, with the refrigerator and cabinets visible behind him. The younger girl adds with a flat, unimpressed tone, "Cool."

Data Annotation Pipeline for Training DiaDem

SFT Data Curation

Figure 1: Leveraging the complementary strengths of different models, we design a dedicated pipeline to construct a high-quality audiovisual caption corpus featuring precise dialogue descriptions for SFT, equipping the model with foundational dialogue description skills while maintaining its general captioning performance.

Key Features of DiaDemBench

First-of-its-Kind Benchmark

The first dedicated benchmark to evaluate the accuracy of dialogue descriptions in audiovisual video captioning, focusing on both correct speaker attribution and precise utterance transcription.

Robust Evaluation Protocol

A principled evaluation framework consisting of ASR (utterance transcription accuracy) and REF (speaker reference accuracy) scores, incorporating a novel adaptive merging strategy for dialogue tuple matching and an MLLM-based judge for verifying speaker consistency.

High-Quality Annotation

A hybrid annotation pipeline in which initial dialogue descriptions are generated using Gemini-2.5-Pro, followed by meticulous manual refinement to ensure accurate utterance transcriptions and reliable speaker attribution.

Comprehensive and Diverse Scenarios

A collection of 1,039 videos covering a wide spectrum of dialogue-centric scenarios, with broad category coverage and balanced distributions, enabling robust and generalizable evaluation.

Data Statistics of DiaDemBench

SFT Data Statistics

Figure 2: DiaDemBench features relatively balanced distribution of speaker count, on-screen people count, video duration, and language diversity, while carefully modulating the difficulty of speaker attribution and utterance transcription.

Challenging Categories of DiaDemBench

Challenging Categories

Figure 3: We showcase four representative dialogue scenarios from DiaDemBench that remain challenging for existing state-of-the-art audiovisual video captioning models to produce accurate dialogue descriptions, with the aim of providing insights for future advancements in audiovisual captioning.

Evaluation Results on DiaDemBench

Evaluation Results on DiaDemBench

Table 1: Model performance on DiaDemBench. \(N\) denotes the speaker count. "Overlap" refers to subsets with temporally overlapping speech and is mutually exclusive with the groups defined by speaker count \(N\). *For ARC-Qwen-Video-Narrator, speaker and utterance information appears only in the thinking phase rather than the final answer, thus we use the thinking content as the model's output for evaluation.

Evaluation of DiaDem on General Audiovisual Captioning Benchmarks

Evaluation Results on General Captioning

Table 2: Model performance on general audiovisual video captioning benchmarks. Following AVoCaDO, we replace the judge model for the video-SALMONN-2 testset with GPT-4.1 to ensure more reliable evaluation.

Ablation Studies

Additional Cases

BibTeX

@misc{chen2026diademadvancingdialoguedescriptions,
        title={DiaDem: Advancing Dialogue Descriptions in Audiovisual Video Captioning for Multimodal Large Language Models}, 
        author={Xinlong Chen and Weihong Lin and Jingyun Hua and Linli Yao and Yue Ding and Bozhou Li and Bohan Zeng and Yang Shi and Qiang Liu and Yuanxing Zhang and Pengfei Wan and Liang Wang and Tieniu Tan},
        year={2026},
        eprint={2601.19267},
        archivePrefix={arXiv},
        primaryClass={cs.CL},
        url={https://arxiv.org/abs/2601.19267}, 
  }