LLM-CapGen is a lightweight framework for video caption generation using large language models. The framework focuses on effective multimodal fusion while maintaining low computational cost. It extracts spatio-temporal features from video frames to capture visual events and motion patterns. It also extracts audio features, including speech and environmental sounds. In addition, the framework constructs a semantic scene graph that represents objects, actions, and temporal relations. This graph helps the model understand what happens, when it happens, and how events are connected.
The framework fuses visual, audio, and semantic features into structured prompts and feeds them to a large language model. This process enables the model to generate context-aware, temporally consistent captions. A post-processing step further improves caption quality by reducing redundancy and ensuring grammatical correctness across video segments.
We evaluate LLM-CapGen on the EgoSchema benchmark for long-form egocentric video understanding. The framework achieves strong results with 34.2 BLEU-4, 25.8 METEOR, 49.7 ROUGE-L, and 102.3 CIDEr scores. It outperforms several strong baseline models, including VideoBERT, CLIPCap, BLIP-2, and GPT-4V in zero-shot settings. Ablation studies show that audio encoding, semantic graph reasoning, and temporal prompting each improve caption quality.
These results show that lightweight multimodal fusion with structured prompting can produce accurate and meaningful video captions. LLM-CapGen provides an efficient and adaptable solution for multimodal video understanding tasks.