Winning the iterated prisoner‘s dilemma with neural networks and the wisdom of Sun Tzu (2)

Part 2

Python code walkthrough

This is part 2 of a series of blog posts. Please read part 1 first.

In this 2nd blog post I want to present a brief description of the python code of my implementation of the iterated prisoner's dilemma experiment.

All agent classes are derived from the Agent base class. A simplified version of it looks like this:

class Agent:

    no_of_instances = 0

    def __init__(self) -> None:
        self.instance_no = Agent.no_of_instances
        Agent.no_of_instances += 1

    def reset(self):
        self.history = []
        self.payoff = 0

    def get_decision(self):  # abstract method
        return None

    def set_result(self, result, payoff):
        self.history.append(result)
        self.payoff += payoff

From this base class many agent variants with different behavior can be constructed. We will discuss some of them later.

Now a population of agents needs to be generated. The frequency of each agent class can be defined:

agent_distribution = [
    (DefectAgent,0.15),
    (CooperateAgent,0.15),
    (TitForTatAgent,0.15),
    (ForgivingTitForTatAgent,0.4),
    #(GrimTriggerAgent,0.1),
    (RandomAgent,0.05),
    #(RNNPredictorAgent,0.1), 
    #(LSTMPredictorAgent,0.075),
    (OptimisticRNNPredictorAgent,0.05),
    (LookAheadRNNPredictorAgent,0.05),
    #(SmartLearnLookAheadRNNPredictorAgent,0.1)
    #(ThresholdedOptimisticRNNPredictorAgent,0.05)
]

The creation probabilities (second element of tuple) should add up to 1. This is checked and an error is output if agent_distribution is not configured correctly.

Now NO_OF_AGENT agents are generated according to this distribution (i.e. the agents list is populated):

for i in range(NO_OF_AGENTS):
        # find type of agent
        ag = [ a[0] for a in agent_distribution ]
        weights = [ a[1] for a in agent_distribution ]
        selected_agent = choices(ag,weights)[0]

        # get parameter list of agent class
        required_params = selected_agent.get_config_options()
        params = {}
        for p in required_params:
            params[p] = hyperparameters[p]
        # create agent
        agents.append(selected_agent(**params))

Now, for a number of training episodes, the following steps are repeated:

1 - Two different (!) agents are chosen randomly:

while True:
    agent1 = choice(agents)
    agent2 = choice(agents)
    if agent1 != agent2:
        break

2 - Before each encounter with another agent, the reset methods of both agent instances must be called which resets the history list to [] and the payoff variable to 0:

agent1.reset()
agent2.reset()

3 - Then we let the two agents play a (fixed) number of steps against each other. Note that this number of steps is always unknown to the agents:

for j in range(steps_per_episode):
    # get the decisions from both agents
    cooperating1 = agent1.get_decision()  # some parameters omitted
    cooperating2 = agent2.get_decision()

    # communicate the results back to the two agents
    if cooperating1 and cooperating2:
        agent1.set_result((True, True),-2)
        agent2.set_result((True, True),-2)
    elif (not cooperating1) and (not cooperating2):
        agent1.set_result((False, False),-5)
        agent2.set_result((False, False),-5)
    elif cooperating1 and (not cooperating2):
        agent1.set_result((True, False),-10)
        agent2.set_result((False, True),0)
    else: # (not cooperating1) and cooperating2
        agent1.set_result((False, True),0)
        agent2.set_result((True, False),-10)

This are the key ideas implememented in the code. Now let's look at some examples of agent classes:

DefectAgent and CooperateAgent are agents with very simple constant response.

class DefectAgent(Agent):

    def get_decision(self, learning_phase, maturity):
        return False


class CooperateAgent(Agent):

    def get_decision(self, learning_phase, maturity):
        return True

The well known Tit for Tat agent always starts with cooperation. In later steps the decision depends on the previous move of the adversary: if the adversary defected in the last iteration, it will defect too, otherwise it will cooperate.

class TitForTatAgent(Agent):

    def get_decision(self, learning_phase, maturity):
        if len(self.history)==0:
            return True  # always start with cooperation!
        else:
            return self.history[-1][1]  # repeat opponents last decision

Another classic agent is the Grim Trigger agent. It starts - like the Tit for Tat agent - always cooperating. In later decisions, the agent only cooperates if the adversary has never defected before.

class GrimTriggerAgent(Agent):

    def get_decision(self, learning_phase, maturity):
        if len(self.history)==0:
            return True  # start with cooperation
        else:
            # if the opponent defected once -> always defect from then on
            cooperate = True
            for h in self.history:
                if h[1] == False:
                    cooperate = False
            return cooperate

The more advanced agents which make this implementation interesting are based on neural networks. I have tried two different architectures and the simpler (RNN) surprisingly seems to work better than the more sophisticated one (LSTM). They all try to predict the adversary's next move from the history list of previous moves and implement different strategies of how to react based on this prediction.

Due to the nature of the experiments, the learning process cannot use batches. This makes learning rather inefficient. The agent based on tree search (class LookAheadRNNPredictorAgent) is computationally very expensive. To speed up the experiment, it helps to create only a minimal fraction of such agents.

The complete python code for this experiment is available for download on my Github page (MIT license).

In the next blog post (following soon), I will present some results and insights from my experiments. Stay tuned.

Follow me on Twitter to get informed about new content on this blog.

I don’t like paywalled content. Therefore I have made the content of my blog freely available for everyone. But I would still love to invest much more time into this blog, which means that I need some income from writing. Therefore, if you would like to read articles from me more often and if you can afford 2$ once a month, please consider supporting me via Patreon. Every contribution motivates me!