LLM-CapGen: A Lightweight Framework for Video Caption Generation Using Large Language Models

Biswarup Yogi

We are working on a new version of the website! To complete the upgrade, Sciforum will be unavailable on Saturday 18 July from 09:00 to 15:00 CEST. Thank you for your understanding.

Previous Article in event

Algorithm Design and Mathematical Modeling for Efficient Automatic Speech Recognition in Low-Resource African Languages

Next Article in event

Robust Adaptive Neural Network Control for a Class of Uncertain Fractional-Order Chaotic Systems

LLM-CapGen: A Lightweight Framework for Video Caption Generation Using Large Language Models

Biswarup Yogi

¹ Computational Sciences, Brainware University, Kolkata 700125, India

Academic Editor: Marjan Mernik

Published: 04 June 2026 by MDPI in The 2nd International Online Conference on Mathematics and Applications session Mathematics, Computer Science and Artificial Intelligence

Abstract:

LLM-CapGen is a lightweight framework for video caption generation using large language models. The framework focuses on effective multimodal fusion while maintaining low computational cost. It extracts spatio-temporal features from video frames to capture visual events and motion patterns. It also extracts audio features, including speech and environmental sounds. In addition, the framework constructs a semantic scene graph that represents objects, actions, and temporal relations. This graph helps the model understand what happens, when it happens, and how events are connected.

The framework fuses visual, audio, and semantic features into structured prompts and feeds them to a large language model. This process enables the model to generate context-aware, temporally consistent captions. A post-processing step further improves caption quality by reducing redundancy and ensuring grammatical correctness across video segments.

We evaluate LLM-CapGen on the EgoSchema benchmark for long-form egocentric video understanding. The framework achieves strong results with 34.2 BLEU-4, 25.8 METEOR, 49.7 ROUGE-L, and 102.3 CIDEr scores. It outperforms several strong baseline models, including VideoBERT, CLIPCap, BLIP-2, and GPT-4V in zero-shot settings. Ablation studies show that audio encoding, semantic graph reasoning, and temporal prompting each improve caption quality.

These results show that lightweight multimodal fusion with structured prompting can produce accurate and meaningful video captions. LLM-CapGen provides an efficient and adaptable solution for multimodal video understanding tasks.

Keywords: Video Captioning, Large Language Models (LLMs), Multimodal Learning, Semantic Graph Reasoning, Natural Language Generation

11 Reads
0 Recommendations

Biswarup Yogi