Markov Chains Explained

Markov chains are used for everything - statistics, biology, and machine learning of course. They allow for some very interesting probability theory.

Let’s start with an example.

Imagine there is a restaurant that serves 3 different kinds of foods: burger, pizza, and hot-dog. However, they have a weird rule where they only serve one of these 3 items on any given day. And, it depends on what they served yesterday.

Essentially, if you know the probabilities and what they served today, you could predict what they will serve tomorrow.

Let’s say we are given a graph with all 3 foods, with arrows pointing to other foods with a number between 0 and 1 for their probability.

If we are on pizza today, we can predict that there is a $70%$ chance of having hotdogs tomorrow.

While quite a simple diagram, this is actually a complete Markov chain!

In more formal math notation, we can represent the probability of a certain food choice on a given day considering only what was had yesterday:

P (X_{n + 1} = x ∣ X_{1} = x_{1}, X_{2} = x_{2}, \dots, X_{n} = x_{n})

This is the essential property of Markov chains: you only need the previous state to figure out the next, and not the whole history/probability distribution.

Example

For finding the probability of getting Hotdog given that the previous day had Pizza(on the $4_{t h}$ day):
$P (X_{4} = H OT D OG ∣ X_{3} = P I ZZ A) = 0.7$

This is known as the Markov Property.

The other important property is that sum of the weights of the outgoing arrows from any state is equal to 1. This makes sense, as 1 represents the total probability of an action, and if all options dont add up to it, something is wrong.

However, there are special Markov Chains with unique properties that break this rule, we will discuss later.

Exploring the Probability Distribution

To get a feel for what you might see over a long time period of this chain, lets walk the chain and see what we get:

After 10 steps, we are left with this. But how can we find the probability of each of the food items, aka a probability distribution? This is not a particularly meaningful question as it is quite simple, but leads to a bigger picture idea later.

Well, for each food we get the number of occurrences over the total sample size.

P (P I ZZ A) P (B U RGER) P (H OT D OG) = \frac{2}{10} = \frac{4}{10} = \frac{4}{10}

The more interesting question though, is what do these probabilities converge to? Aka, how does this new Markov graph translate to our more traditional probability summarization methods?

Let’s approach this with a brute force python implementation:

Python

import random

# verbose, but clear
transitions = {
    "burger": [
        ("burger", 0.2),
        ("pizza", 0.6),
        ("hotdog", 0.2)
    ],
    "pizza": [
        ("burger", 0.3),
        ("hotdog", 0.7)
    ],
    "hotdog": [
        ("hotdog", 0.5),
        ("burger", 0.5)
    ],
}

counts = {"burger": 0, "pizza": 0, "hotdog": 0}

# initial state
state = "burger"

# approximation value, kind of like a derivative step dx
steps = 100000

for _ in range(steps):
    # update counts
    counts[state] += 1

    # gets transistions for a state, extracts keys and values
    options, probs = zip(*transitions[state])
    state = random.choices(options, probs)[0]

# convert counts to probabilities
total = sum(counts.values())
estimated_probs = {k: v / total for k, v in counts.items()}

print("Estimated probabilities after simulation:")
for food, prob in estimated_probs.items():
    print(f"{food}: {prob:.4f}")

Output

Running this, we find that the items converge to some interesting and quite specific values.

P (b u r g er) : 0.3521 P (p i zz a) : 0.2112 P (h o t d o g) : 0.4366

After this point, the distribution reaches a stationary state meaning it will no longer change with time. While this works, this is not a very efficient way to compute the distribution, and leaves the question if there is a more mathematical approach. We also dont know if there are any other stationary states.

Well, there is a better way to represent it, through the user of Linear Algebra

Using Matrices for Markov Representations

In reality, a Markov change is essentially just a directed graph, something we are very familiar with (as computer scientists for graph databases, and in theory of computing with Automota)

Because of this, we can represent the previous graph with a simple adjacency matrix:

From \ To Burger Pizza Hotdog Burger 0.2 0.3 0.5 Pizza 0.6 0 0.5 Hotdog 0.2 0.7 0

Where the Columns represent the initial state, and the Rows represent the destination state.

In Linear Algebra, we would represent this as a Transition Matrix $A$ :

A = 0.2 0.3 0.5 0.6 00 0.2 0.7 0.5

We will then use a secondary matrix $π$ to represent the probability distribution of the states. As the Markov Chain progresses through time, the matrix will approach values like the ones we found with the python Approach.

If we begin on a pizza day, the first state of the $π$ matrix would look like:

π_{0} = [010]

Something interesting happens when we multiply these two matrices together:

π_{0} A = [010] 0.2 0.3 0.5 0.6 00 0.2 0.7 0.5 = [0.3 0 0.7]

We have extracted the second row of the transition matrix, which happens to represent the probabilities of future foods given it is a pizza day.

Now, if you take this resulting value and put it in the place of the initial $π_{0}$ matrix:

π_{1} A = [0.3 0 0.7] 0.2 0.3 0.5 0.6 00 0.2 0.7 0.5 = [0.41 0.18 0.41]

We get this new interesting value. What does it mean? Let’s repeat this a few more time for the next few iterations of $π$ :

π_{2} A = [0.41 0.18 0.41] 0.2 0.3 0.5 0.6 00 0.2 0.7 0.5 = [0.34 0.25 0.41]

Notice how this seems to be getting closer and closer to the stationary state? Thats because it is!

If there exists a stationary state for this initial choice, then after repeating the process several times, the resultant matrix will converge on a stationary value. Eventually, the output vector will be identical to the input vector.

Denoting this special row vector as $π$ we can write(for a converging stationary state):

π A = π

As Linear Algebra students, we will recognize this as similar to the eigenvector equation:

A v = λ v

Just by considering $λ = 1$ and reversing the order of multiplication, we get our equilibrium state equation.

How do we interpret this? We imagine $π$ is a left eigenvector of matrix $A$ , with the eigenvalue equal to 1.

The eigenvector in this approach must also satisfy another condition: the elements of $π$ must add up to 1( as it is a probability distribution ).

π [1] + π [2] + π [3] = 1

After solving these two equation, we are left with the finalized stationary state:

π = [\frac{25}{71} \frac{15}{71} \frac{31}{71}]

Converting to decimal:

π = [0.35211 0.21127 0.43662]

Very similar to our brute force approach!

Using this technique, we can also find out if there are more than one stationary state. This can be done, as you might expect, by finding out if there exists any more than one eigenvectors! Very nice.

Graph View

Markov Chains Explained - 1

Exploring the Probability Distribution

Using Matrices for Markov Representations

Table of Contents

Explore

Explore