Optimizing Reward Mechanisms for Reinforcement Learning in Mahjong AI: A Hybrid Approach with Prior Knowledge Integration

Jiayang Wang; Yang Xu; Chuangji Zeng

Abstract:

Mahjong, an imperfect information game with a state space exceeding $10^{48}$, poses significant challenges for reinforcement learning (RL) due to its sparse rewards and delayed feedback. Traditional RL approaches relying solely on terminal win/loss signals often suffer from inefficient convergence and suboptimal strategy development. This study proposes a hybrid reward mechanism that integrates prior knowledge and predictive modeling to address these limitations.

The framework combines three components to provide dense, interpretable feedback. First, an instructor model based on a CNN-LSTM architecture is trained on expert gameplay data to generate auxiliary rewards by evaluating the alignment between agent actions and expert-predicted optimal moves. Data augmentation techniques, such as tile suit permutation, are employed to enhance generalization despite the limited training samples. Second, a dual-layer GRU-based global reward predictor captures long-term dependencies in the game, forecasting the winning probabilities and score dynamics to enable intermediate reward signals. Third, the hybrid reward system integrates a base reward structure—assigning +3 for wins, -1 for losses, and 0 for draws—scaled by hand usingcomposition multipliers. Dynamic bonuses are further applied based on the consistency between the predicted and actual outcomes, while step-wise predictive rewards quantify decisions' impacts through the differences in successive value estimations.

Experimental evaluations on 10,000 2v2 Shangrao Mahjong games demonstrated the superior performance of the proposed method. Compared to supervised learning, Proximal Policy Optimization RL (PPO RL), multi-agent RL, and deep RL baselines, the hybrid mechanism achieved the highest average score of 5.68, outperforming the PPO RL baseline score of 5.24. The model reached a win rate of 26.56% with accelerated convergence, achieving a 30% win rate in 15 days versus the 35 days required by deep RL. Its strategic efficacy was validated by its maximum average win score of 15.49. By synthesizing knowledge-driven guidance and model predictive feedback, this approach effectively mitigates reward sparsity while enhancing training stability and the decision-making depth.