DIVE — Dense Video Understanding with Gated Residual Tokenization

🎬 Teaser · Dense frames, high-FPS, QA-centric

High-Frame-Rate Understanding, Without the Token Tax

📚

DIVE Benchmark
First dataset tailored for dense information across frames: educational videos, sign language, procedures, sports breakdowns.

⚙️

GRT (Gated Residual Tokenization)
Skip static regions during tokenization + merge redundant tokens within scenes → scalable high-FPS reasoning.

🧠

VLM-Ready
Sub-linear token/time growth under dense sampling; compatible with LLaVA-OV style pipelines.

Dense QAAnswer in nearly every frame

High-FPSFine-grained temporal cues

Token-SmartSkip & Merge by GRT

Authors

Haichao Zhang¹ · Wenhao Chai² · Shwai He³ · Ang Li³ · Yun Fu¹

1 Northeastern University | 2 Princeton University | 3 University of Maryland, College Park

What is DIVE?

DIVE (Dense Information Video Evaluation) targets scenarios where content is dense across frames (e.g., educational videos, surgical procedures, sign language). Conventional VLLMs rely on low-FPS sampling and keyframes, dropping critical temporal details needed for frame-by-frame reasoning.

See the paper for motivation and task definition. [PDF]

GRT in a Nutshell

Motion-Compensated Gated Inter-Tokenization: motion masks skip static regions during tokenization → sub-linear growth in token count/time.
Semantic-Scene Intra-Tokenization Merging: merge redundant tokens within a scene while preserving dynamic semantics.

Together, Gated Residual Tokenization (GRT) enables scalable high-FPS understanding on DIVE. See arXiv HTML.

Dataset

Released: DIVE test split on 🤗 Hugging Face.

haichaozhang/DenseVideoEvaluation

from datasets import load_dataset
ds = load_dataset("haichaozhang/DenseVideoEvaluation", split="test")
print(ds[0])

Evaluate via LMMS-EVAL

We are preparing a PR to integrate DIVE into LMMS-EVAL.

git clone https://github.com/EvolvingLMMs-Lab/lmms-eval.git
cd lmms-eval
pip install -e .

accelerate launch \
  --num_processes=1 \
  -m lmms_eval \
  --model llava_onevision \
  --model_args "pretrained=lmms-lab/llava-onevision-qwen2-0.5b-ov,conv_template=qwen_1_5,model_name=llava_qwen" \
  --tasks mme \
  --batch_size 1 \
  --log_samples \
  --output_path ./logs/ \
  --verbosity=DEBUG

# Dense-video variant (placeholder)
accelerate launch \
  --num_processes=1 \
  -m lmms_eval \
  --model llava_ov_dense_video \
  --model_args "pretrained=lmms-lab/llava-onevision-qwen2-0.5b-ov,conv_template=qwen_1_5,model_name=llava_qwen,use_gated_tok=True,use_vision_merge=False,profiling=False,dense_frame_fps=0.001" \
  --tasks mvbench \
  --batch_size 1 \
  --log_samples \
  --output_path ./logs/ \
  --verbosity=DEBUG

Timeline

Date	Status	Description
2025/09/18	✅	Release DIVE benchmark (test split)
TBD	⭕	Merge DIVE into LMMS-EVAL (PR in prep)
TBD	⭕	Release multi-FPS dataset variants
TBD	⭕	Add more dense-video task categories
TBD	⭕	Release full GRT model + training/inference code

Citation

@article{zhang2025dive,
  title={Dense Video Understanding with Gated Residual Tokenization},
  author={Haichao Zhang and Wenhao Chai and Shwai He and Ang Li and Yun Fu},
  journal={arXiv preprint arXiv:2509.14199},
  year={2025}
}