Winning the iterated prisoner‘s dilemma with neural networks and the wisdom of Sun Tzu (1)

In this blog post (and some others following later) I will discuss my experiments with the „iterated prisoner’s dilemma“. The „prisoner’s dilemma“ is a game theory thought experiment: two rational agents (rational in the sense that they always try to make optimal decisions) can cooperate for mutual benefit or betray their partner ("defect") for individual reward. As the problem is very famous and many descriptions can be found on the web (like the Wikipedia article) I do not want to describe it here again. Please read one of the many articles about it before proceeding here.

The iterated prisoner’s dilemma is very important, because it models many real-world situations. It is fundamental for understanding human cooperation and trust. Because these topics interest me very much, I wrote a python library to make some experiments. There are already several great libraries for conducting experiments with the iterated prisoner’s dilemma (like Axelrod). The reasons/goals for developing my own version:

I wanted to fully understand it. For this, development from scratch is always helpful
As small and simple as possible. The library should allow easy modifications of any aspect of the code.
Object oriented design of agent code (allowing to easily create several variants of agents built on base agents)
Include agents based on neural networks

The motivation for including agents based on neural networks came from the writings of Sun Tzu (771–256 BC), a Chinese military general, strategist and philosopher. He wrote the influential book „The Art of War“.

A starting point was the following sentence:

«If you know the enemy and know yourself, you need not fear the result of a hundred battles.» — Sun Tzu (The Art of War)

Therefore I was curious if it was possible to build agents, which could learn (from many adversaries) to quickly understand the current adversary („know the enemy“) and improve their strategy accordingly.

My python code will be discussed in a later blog post (I will also upload it to my GitHub account then). It includes many standard agents (like always cooperating, always defecting, „Tit for tat“ and Grim Trigger) but I was mainly interested in building learning neural network based agents.

The first version (in the python code: class RNNPredictorAgent and subclasses) tries to predict the next move of the adversary and uses a fixed strategy to act on this insight: if the probability of cooperation is larger than a certain threshold (like 0.5), the agent will cooperate, otherwise it will defect. Relatively simple RNN (and with less success LSTM) artificial neural networks are used for learning.

My experiments showed that such an agent does not offer significant advantages over the „Tit for tat“ strategy (which is known to be optimal in many cases). It shows decent performance, which proves that the prediction using a recurrent neural network works at least to some extent. The reason why its performance disappoints is also that such a strategy is unable to exploit a „too nice“ agent (e.g. an agent which always cooperates).
Therefore, to learn about the behavior of adversaries makes little sense with such a fixed strategy. A simple fixed strategy without learning, like „Tit for tat“ (which is equally unable to exploit „too nice“ agents!) will do as well or better.

So what then is our large brain good for? Could it be that it only makes sense (at least from the aspect of social behavior) because the capability to exploit „too nice“ adversaries is important while at the same time it must be able to cooperate with those agents which punish uncooperative behavior?

«To know your enemy, you must become your enemy.» — Sun Tzu (The Art of War)

And this is highly interesting as it hints at something deep: that good (here in the sense of cooperative behavior) and evil (in the sense of defecting behavior) might be coupled. Even Sun Tzu imagined this:

«It is easy to love your friend, but sometimes the hardest lesson to learn is to love your enemy.»

I therefore built a more advanced agent (class LookAheadRNNPredictorAgent) which is performing a tree search over the possible moves (its own and those of the adversary) and calculating the most likely outcome (and the reward/damage from this outcome). Again a neural network is used to predict the move of the adversary at each node of the tree. The code for this is far more complex than the first agent mentioned above (with a fixed strategy). In some sense this agent has in fact no strategy at all, it just tries to predict the outcomes of different action options and choses the action with the (probably) better outcome according to the logic of the game. It somehow is the adversary (as required by Sun Tzu).

But now things become much more complicated:

What the agent will learn from other agents depends on how he behaves. But as the behavior is completely undefined in the beginning (when the agent does not yet know anything about the behavior of other agents) the learning process becomes easily unstable.
Therefore, in an initial learning phase the agent must employ a (completely or partially) fixed strategy. This strategy determines what the agent will experience and therefore also how it will later predict the actions of other agents.
This initial learning phase strategy must be diverse as the agent must learn to play well with different kinds of adversary agents. This is called exploration. So it cannot be too simple (like just always „Tit for tat“) and must change (with time and/or stochastically from game to game).
The experience of the agent (and therefore what it learns) also depends heavily on the population of other agents (e.g. rather cooperative or defecting rather quickly).
Because of the tree search, this agents needs much more processing power than other agents. So training is much more difficult and time consuming.

So this agent gave me lots of headaches.

In the next blog post I will present a brief description of the python code used for my experiments.

An excellent interactive introduction to the iterated prisoners dilemma can be found here:

The Evolution of Trust (ncase.me/trust/)

Follow me on X to get informed about new content on this blog.

I don’t like paywalled content. Therefore I have made the content of my blog freely available for everyone. But I would still love to invest much more time into this blog, which means that I need some income from writing. Therefore, if you would like to read articles from me more often and if you can afford 2$ once a month, please consider supporting me via Patreon. Every contribution motivates me!