Mimicking Human Behavior in the Stochastic Prisoner&rsquo;s Dilemma

Evangelia Chalioti; Himnish Hunma

Abstract:

Introduction: This paper studies human behavioral patterns exhibited in the Iterated Stochastic Prisoner’s Dilemma and trains algorithms that replicate these patterns. We create a training framework that enables learning algorithms to capture human biases and inconsistencies observed during human play. We find that a Reinforcement Learning algorithm (Q-Learner) can emulate how human agents update their beliefs about the future, akin to pessimism and optimism, once the final form of the game is realized.

Methods: We adopt the stochastic setup introduced by Kloosterman (2020), in which human agents play an infinitely repeated game while facing uncertainty regarding the payoff matrix that isused to calculate the reward in each iteration of the dilemma. The defective probability—the probability that the payoff matrix offering lower returns to cooperation is realized—is allowed to vary. To model learning behavior, we implement a randomized, teacherless training framework applied to a Q-Learning algorithm. This approach does not require pre-existing data. Rather, the Q-Learner learns to play via cues from its environment. It is trained on over 230 Prisoner Dilemma strategies.

Results: We find that at the beginning of the game, the trained Q-Learner exhibits a cooperative majority which persists at most values of the defective probability. Flips from a cooperative to a defective majority tend to persist at higher values of the defective probability, although cooperative majorities remain the prevailing outcome. The range of defective probabilities at which the Q-learner flips to a defective majority aligns with the Grim Trigger probability at which humans exhibit the same switch in behavior in Kloosterman's experiments.

Conclusion: The value of defective probability leads players to make optimistic or pessimistic assumptions about the future, impacting their current behavior. Under high learning rates, the Q-Learner can exhibit the same behavior after following randomized training.