Enhancing Candidate Generation in Recommendation Systems through LLM-Powered Semantic Enrichment in a Distributed Environment

Balagangadhar Reddy Kandula; Lija Jacob

Previous Article in event

Efficient Glaucoma Detection through a Custom CNN Architecture on Retinal Fundus Datasets

Next Article in event

Explainable Artificial Intelligence for Social Sciences and Humanities: A Systematic Review

Enhancing Candidate Generation in Recommendation Systems through LLM-Powered Semantic Enrichment in a Distributed Environment

Balagangadhar Reddy Kandula

^*,

Lija Jacob

¹ Department of Data Science‬‭, CHRIST (Deemed to be University), Lavasa, India

Academic Editor: Lucia Billeci

Published: 03 December 2025 by MDPI in The 6th International Electronic Conference on Applied Sciences session Computing and Artificial Intelligence

Abstract:

Introduction
Effective candidate generation is critical for two-stage recommender systems; however, traditional methods, such as TF-IDF, often fail to capture the deep semantic context. This limitation leads to suboptimal recall, particularly for new or niche items (the "cold start" problem), negatively impacting the overall quality of the recommendations and user experience. This study addresses the need for a more semantically aware approach to the initial recall phase.
Methods
We propose a novel methodology that integrates a Large Language Model (LLM) into a distributed Apache Spark pipeline for large-scale content enrichment. This process generates high-quality vector embeddings and concise, context-aware summaries for each content item in the feed. These enriched data points were then indexed into Elasticsearch to facilitate efficient and semantically aware vector-based retrieval during the candidate generation phase.
Results
Our quantitative analysis compared the LLM-enriched method against a traditional TF-IDF baseline using the Recall@10 metric. The proposed method achieved a Recall@10 of 62%, representing a 37% relative improvement over the baseline's 45%. This demonstrates a substantial increase in the relevance of generated candidates. Furthermore, the resulting candidate pool showed a marked improvement in semantic diversity, better covering niche user interests and improving the quality of items passed to the ranking stage.
Conclusions
Leveraging LLMs for semantic enrichment in a distributed environment provides a powerful solution for enhancing the recall stage of recommender systems. This method provides a richer, more contextually aware input for downstream ranking models and effectively mitigates the cold-start problem, paving the way for more accurate and personalized content discovery.

Keywords: Recommender Systems; Large Language Models; Candidate Generation; Semantic Enrichment; Apache Spark; Vector Embeddings; Cold Start Problem

21 Reads
0 Recommendations

Balagangadhar Reddy Kandula

Lija Jacob