Reinforcement learning (RL) has emerged as a promising paradigm for autonomous control tasks, where agents must learn sequential decision-making in dynamic and uncertain environments. Among RL algorithms, Deep Q-Networks (DQN) and Advantage Actor-Critic (A2C) represent two widely used approaches—value-based and policy-based, respectively—each with distinct strengths and limitations. This study presents a comparative analysis of DQN and A2C applied to the challenging CarRacing-v3 environment, where agents must handle continuous dynamics such as steering, acceleration, and braking.
Both agents were trained using preprocessed input frames, which involved grayscale conversion, cropping, resizing, normalization, and frame stacking to capture temporal dependencies. DQN was implemented with experience replay, target networks, and Double DQN extensions, while A2C employed a shared convolutional encoder for actor and critic networks with entropy regularization to encourage exploration. Training progress was measured using average return, stability, and computational efficiency.
Results revealed that neither DQN nor A2C achieved consistently stable driving policies. DQN struggled with continuous control due to its discretized action formulation, resulting in fluctuating average returns and poor convergence (final average return ≈ –71.5). A2C, while better suited for continuous actions, also stagnated with limited learning progression (average return ≈ –72.4), suggesting inefficiencies in exploration and sensitivity to hyperparameters.
In conclusion, this study highlights the challenges of applying classical RL algorithms to high-dimensional autonomous driving tasks. The findings provide empirical insights into their trade-offs and point toward the need for hybrid or advanced methods—such as DDPG, TD3, or SAC—that combine stability, efficiency, and adaptability for real-world autonomous driving applications.
