🤿 DENSE VIDEO UNDERSTANDING WITH GATED RESIDUAL TOKENIZATION
Dense Information Video Evaluation (DIVE) Benchmark

The first benchmark dedicated to QA-driven high-frame-rate comprehension, where answer-relevant information appears in nearly every frame.
Status: Benchmark test split released • Model will release after acceptance.

ArXiv HuggingFace Dataset GitHub Project Site
Dense Video QA High-FPS VLM Gated Tokenization
DIVE Teaser
🎬 Teaser · Dense frames, high-FPS, QA-centric

High-Frame-Rate Understanding, Without the Token Tax

📚
DIVE Benchmark
First dataset tailored for dense information across frames: educational videos, sign language, procedures, sports breakdowns.
⚙️
GRT (Gated Residual Tokenization)
Skip static regions during tokenization + merge redundant tokens within scenes → scalable high-FPS reasoning.
🧠
VLM-Ready
Sub-linear token/time growth under dense sampling; compatible with LLaVA-OV style pipelines.
Dense QAAnswer in nearly every frame
High-FPSFine-grained temporal cues
Token-SmartSkip & Merge by GRT
Authors
1 Northeastern University  |  2 Princeton University  |  3 University of Maryland, College Park
NEU Princeton UMD
What is DIVE?

DIVE (Dense Information Video Evaluation) targets scenarios where content is dense across frames (e.g., educational videos, surgical procedures, sign language). Conventional VLLMs rely on low-FPS sampling and keyframes, dropping critical temporal details needed for frame-by-frame reasoning.

See the paper for motivation and task definition. [PDF]

GRT in a Nutshell
  • Motion-Compensated Gated Inter-Tokenization: motion masks skip static regions during tokenization → sub-linear growth in token count/time.
  • Semantic-Scene Intra-Tokenization Merging: merge redundant tokens within a scene while preserving dynamic semantics.

Together, Gated Residual Tokenization (GRT) enables scalable high-FPS understanding on DIVE. See arXiv HTML.

Dataset

Released: DIVE test split on 🤗 Hugging Face.

haichaozhang/DenseVideoEvaluation

from datasets import load_dataset
ds = load_dataset("haichaozhang/DenseVideoEvaluation", split="test")
print(ds[0])
Evaluate via LMMS-EVAL

We are preparing a PR to integrate DIVE into LMMS-EVAL.

git clone https://github.com/EvolvingLMMs-Lab/lmms-eval.git
cd lmms-eval
pip install -e .
accelerate launch \
  --num_processes=1 \
  -m lmms_eval \
  --model llava_onevision \
  --model_args "pretrained=lmms-lab/llava-onevision-qwen2-0.5b-ov,conv_template=qwen_1_5,model_name=llava_qwen" \
  --tasks mme \
  --batch_size 1 \
  --log_samples \
  --output_path ./logs/ \
  --verbosity=DEBUG
# Dense-video variant (placeholder)
accelerate launch \
  --num_processes=1 \
  -m lmms_eval \
  --model llava_ov_dense_video \
  --model_args "pretrained=lmms-lab/llava-onevision-qwen2-0.5b-ov,conv_template=qwen_1_5,model_name=llava_qwen,use_gated_tok=True,use_vision_merge=False,profiling=False,dense_frame_fps=0.001" \
  --tasks mvbench \
  --batch_size 1 \
  --log_samples \
  --output_path ./logs/ \
  --verbosity=DEBUG
Timeline
DateStatusDescription
2025/09/18Release DIVE benchmark (test split)
TBDMerge DIVE into LMMS-EVAL (PR in prep)
TBDRelease multi-FPS dataset variants
TBDAdd more dense-video task categories
TBDRelease full GRT model + training/inference code
Citation
@article{zhang2025dive,
  title={Dense Video Understanding with Gated Residual Tokenization},
  author={Haichao Zhang and Wenhao Chai and Shwai He and Ang Li and Yun Fu},
  journal={arXiv preprint arXiv:2509.14199},
  year={2025}
}