Formalizing an optimal Codenames Pictures strategy using Information Theory

Codenames: Pictures is a fun little board game where you are given the task of communicating which of the cards below are “yours”, where each turn you can say one word and one number.

The goal is to find a strategy that maximizes the information conveyed by the word and number, ideally leading to the correct cards being chosen with minimal ambiguity.

However, we also have to account for the fact that half of the cards belong to the opposing team, and we want to avoid them being chosen. This adds a layer of complexity to our strategy.

We cannot simply discard using features that are shared with the opposing team, as this would severely limit our options. Instead, we need to find a balance between maximizing the information conveyed about our own cards while minimizing the risk of the opposing team being able to make a correct guess.

Information Theory is well suited to analyze this problem. We can use concepts like entropy and mutual information to quantify the amount of information conveyed by a given clue.

Formal Definition

We will be assuming using some image classification model to extract features from the images to form a probability distribution over the images.

Why a probability distribution? Because we want to be able to quantify the uncertainty associated with each image, and a probability distribution allows us to do that.

Say we have a set of features $F$ containing descriptors:

F = {chess, animal, bug, hard, ...}

this set of features can be generated by a pre-trained image classification model

For each one of the cards in the game $X$ , a machine learning model will then give us a probability distribution over the features:

P (f ∣ X), f \in F

We then define a description/clue $d$ that can be a mix/subset of the features in $F$ . We will formalize it as a binary or weighted mask over the features:

Q (f ∣ d)

This gives us a mechanism to evaluate the effectiveness of a clue $d$ in terms of how well it matches the features of the cards.

"dog" "red" \to Q (f = animal) = 0.8, Q (f = pet) = 0.9, \dots \to Q (f = red) = 1.0

The clue defines a posterior over which cards it might refer to. Using Bayes:

P (X ∣ d) \propto P (d ∣ X) P (X)

and defining $P (d ∣ X)$ as:

P (d ∣ X) = f \in F \sum Q (f ∣ d) P (f ∣ X)

This defines a “fit” for a clue $d$ to a card $X$ . E.g. “how likely is it that card would be generated by that description”.

Bringing information theory in now, we will evaluate how much a given clue reduces the uncertainty between cards being a “target” and an “enemy” ( as we don’t want to reveal the opponents cards for them ).

We define:

$T :$ set of our target cards
$E :$ set of the enemies cards
$U :$ all candidate cards

we will be disregarding the blank card for this experiment, as if it isnt working against us and is not a member of the set of our cards, its effect is trivial

We are seeking to find a clue $d$ that maximizes mutual information between $d$ and “is it a member of $T$ “.

We can start by defining a new discrete random variable $Y$ such that:

Y = {10 if card is a target if card is enemy (or neutral)

Thus, the information gained from the clue $d$ is defined as:

information gain I G (d) = I (Y; d) = H (Y) - H (Y ∣ d)

Where:

$H (Y) :$ the entropy of our target vs not-target, e.g. baseline uncertainty
$H (Y ∣ d) :$ the expected entropy after hearing our clue $d$ ( how much left over uncertainty )

$I G (d)$ will be high if the clue $d$ makes the targets much more probable than enemies.

Our Scoring Function

We now need a scoring function to evaluate how good a clue $d$ is at both maximizing information about our targets while minimizing the risk of enemies being chosen.

Expanding out our definition:

H (Y ∣ d) = - x \in U \sum P (x ∣ d) [P (Y = 1 ∣ x) lo g P (Y = 1 ∣ x) + P (Y = 0 ∣ x) lo g P (Y = 0 ∣ x)]

Where:

$P (x ∣ d) :$ defines the likelihood of clue $d$ pointing to card $x$
$P (Y = 1 ∣ x)$ is 1 if $x \in T$ , otherwise it is 0

Another much simpler scoring function can be achieved using the log-likelihood ratio:

S (d) = lo g \frac{P ( d ∣ T )}{P ( d ∣ E )}

Where $P (d ∣ T)$ is defined as:

P (d ∣ T) = \frac{1}{∣ T ∣} \sum P (d ∣ t)

and $P (d ∣ E)$ as:

P (d ∣ E) = \frac{1}{∣ E ∣} \sum P (d ∣ e)

Essentially, we are balancing how much more the given clue matches with targets vs enemies.

If $S (d) > 0$ , then the clue carries a net positive in information for targets, and the magnitude defines the strength of the clue.

Now, the key challenge is to balance our coverage maximizing algorithm with target-enemy ratio.

This gives us two important properties:

mi n_{t \in T} P (d ∣ t) \geq t h res h o l d ma x_{e \in E} P (d ∣ e) p e na lt y

We can use this to assemble a final scoring function:

S core (d) = I (Y; d) \approx lo g \frac{P ( d ∣ T )}{P ( d ∣ E )} \cdot C o v er a g e (d)

Where Coverage is defined as:

C o v er a g e (d) = mi n_{t} (d ∣ t)

ensuring all targets are hit

and log-ratio is defined as the information gain against enemies.

Graph View

Formalizing an optimal Codenames Pictures strategy using Information Theory

Formal Definition

Our Scoring Function