Introduction
Effective candidate generation is critical for two-stage recommender systems; however, traditional methods, such as TF-IDF, often fail to capture the deep semantic context. This limitation leads to suboptimal recall, particularly for new or niche items (the "cold start" problem), negatively impacting the overall quality of the recommendations and user experience. This study addresses the need for a more semantically aware approach to the initial recall phase.
Methods
We propose a novel methodology that integrates a Large Language Model (LLM) into a distributed Apache Spark pipeline for large-scale content enrichment. This process generates high-quality vector embeddings and concise, context-aware summaries for each content item in the feed. These enriched data points were then indexed into Elasticsearch to facilitate efficient and semantically aware vector-based retrieval during the candidate generation phase.
Results
Our quantitative analysis compared the LLM-enriched method against a traditional TF-IDF baseline using the Recall@10 metric. The proposed method achieved a Recall@10 of 62%, representing a 37% relative improvement over the baseline's 45%. This demonstrates a substantial increase in the relevance of generated candidates. Furthermore, the resulting candidate pool showed a marked improvement in semantic diversity, better covering niche user interests and improving the quality of items passed to the ranking stage.
Conclusions
Leveraging LLMs for semantic enrichment in a distributed environment provides a powerful solution for enhancing the recall stage of recommender systems. This method provides a richer, more contextually aware input for downstream ranking models and effectively mitigates the cold-start problem, paving the way for more accurate and personalized content discovery.
Previous Article in event
Next Article in event
Enhancing Candidate Generation in Recommendation Systems through LLM-Powered Semantic Enrichment in a Distributed Environment
Published:
03 December 2025
by MDPI
in The 6th International Electronic Conference on Applied Sciences
session Computing and Artificial Intelligence
Abstract:
Keywords: Recommender Systems; Large Language Models; Candidate Generation; Semantic Enrichment; Apache Spark; Vector Embeddings; Cold Start Problem
