Autonomous lunar probe landing presents a complex control challenge due to limited sensor feedback, delayed communications, and dynamic terrain conditions. In this work, we present a lightweight and optimized reinforcement learning solution using a Deep Q-Network (DQN) agent trained on the LunarLander-v3 environment from the Gymnasium library. Our aim is to develop a model capable of precise, resource-efficient landings under constrained simulation settings.
The agent interacts with an 8-dimensional state space and 4 discrete action choices, learning through experience replay and an ε-greedy policy. We systematically evaluated the impact of neural network architecture (Tiny, Base, Wide, Deep) and conducted extensive hyperparameter tuning via grid search across learning rates, discount factors, and update rates. The best-performing configuration—128-128 Wide architecture, learning rate 0.0005, discount factor 0.99, soft update rate 0.01—demonstrated superior performance, achieving an average reward of 262.89 in 355.98 seconds of training.
Final testing revealed that training to a threshold of 250 reward yields a 93% landing success rate, outperforming both under-trained and over-trained agents in terms of efficiency and generalization. This result was validated across 100 test episodes, confirming consistent, high-accuracy autonomous landing behavior.
Our findings highlight the viability of deploying lightweight, well-tuned DQN agents for real-time lunar landing scenarios. The proposed approach serves as a scalable blueprint for future space robotics systems, bridging the gap between simulation and real-world feasibility. Future work will incorporate terrain complexity and uncertainty modeling to extend robustness in dynamic planetary environments.
