DIVE (Dense Information Video Evaluation) targets scenarios where content is dense across frames (e.g., educational videos, surgical procedures, sign language). Conventional VLLMs rely on low-FPS sampling and keyframes, dropping critical temporal details needed for frame-by-frame reasoning.
See the paper for motivation and task definition. [PDF]
- Motion-Compensated Gated Inter-Tokenization: motion masks skip static regions during tokenization → sub-linear growth in token count/time.
- Semantic-Scene Intra-Tokenization Merging: merge redundant tokens within a scene while preserving dynamic semantics.
Together, Gated Residual Tokenization (GRT) enables scalable high-FPS understanding on DIVE. See arXiv HTML.
Released: DIVE test split on 🤗 Hugging Face.
haichaozhang/DenseVideoEvaluation
from datasets import load_dataset
ds = load_dataset("haichaozhang/DenseVideoEvaluation", split="test")
print(ds[0])
We are preparing a PR to integrate DIVE into LMMS-EVAL.
git clone https://github.com/EvolvingLMMs-Lab/lmms-eval.git
cd lmms-eval
pip install -e .
accelerate launch \
--num_processes=1 \
-m lmms_eval \
--model llava_onevision \
--model_args "pretrained=lmms-lab/llava-onevision-qwen2-0.5b-ov,conv_template=qwen_1_5,model_name=llava_qwen" \
--tasks mme \
--batch_size 1 \
--log_samples \
--output_path ./logs/ \
--verbosity=DEBUG
# Dense-video variant (placeholder)
accelerate launch \
--num_processes=1 \
-m lmms_eval \
--model llava_ov_dense_video \
--model_args "pretrained=lmms-lab/llava-onevision-qwen2-0.5b-ov,conv_template=qwen_1_5,model_name=llava_qwen,use_gated_tok=True,use_vision_merge=False,profiling=False,dense_frame_fps=0.001" \
--tasks mvbench \
--batch_size 1 \
--log_samples \
--output_path ./logs/ \
--verbosity=DEBUG
Date | Status | Description |
---|---|---|
2025/09/18 | ✅ | Release DIVE benchmark (test split) |
TBD | ⭕ | Merge DIVE into LMMS-EVAL (PR in prep) |
TBD | ⭕ | Release multi-FPS dataset variants |
TBD | ⭕ | Add more dense-video task categories |
TBD | ⭕ | Release full GRT model + training/inference code |
@article{zhang2025dive,
title={Dense Video Understanding with Gated Residual Tokenization},
author={Haichao Zhang and Wenhao Chai and Shwai He and Ang Li and Yun Fu},
journal={arXiv preprint arXiv:2509.14199},
year={2025}
}