The Fundamental Theorem of Markov Chains (Proof, Intuition + Uniqueness)

Markov ChainsInformation TheoryData ScienceLinear AlgebraProbabilityStationary Distribution

Explore the Fundamental Theorem of Markov Chains with a rigorous proof of stationary distribution existence and uniqueness using linear algebra and convex combinations.

References: Foundations of Data Science by A.Blum, J.Hopcroft, and S.Saravanan

This article explains the fundamental theorem of Markov chains, including the existence and uniqueness of stationary distributions. We walk through the formal proof using linear algebra concepts such as rank, null space, and convex combinations.

Stationary distribution

Let p(t)p(t) be the probability distribution at time tt. Then

p(t+1)=p(t)Pp(t+1) = p(t)P

The long-term average probability distribution is given by

a(t)=1t(p(0)+p(1)++p(t1))\mathbf{a(t) = \frac{1}{t}(p(0) + p(1) + \dots + p(t-1))}

The fundamental Theorem of Markov Chains

Theorem 1.0

Let PP be a regular transition matrix of a finite state Markov chain. Then there exists a unique stationary distribution π\pi such that

πP=π \pi P = \pi

Before moving to the theorem's proof, we shall discuss an important lemma.

Lemma 1.1

Let PP be the transition probability matrix for a connected Markov chain, then the nx(n+1)n x (n+1) matrix A=[PI,1]A = [P-I,1] obtained by augmenting the matrix PIP-I with an additional column of ones has rank nn.

Proof:

We shall approach the proof by contradiction.

By rank nullity theorem, if the rank of the matrix AA was less than nn, say n1n-1, then the null space of AA would be atleast of 2 dimensions.

By rank-nullity theorem,

rank(A)+nullity(A)=no. of columns(A) \text{rank}(A) + \text{nullity}(A) = \text{no. of columns}(A)

since the rank of AA is n1n-1, and the no. of columns in AA is n+1n+1

(n1)+nullity(A)=n+1(n-1) + \text{nullity}(A) = n+1 nullity(A)=2\text{nullity}(A) = 2

This means that there are atleast 2 vectors in the null space of AA. Let us find those vectors. We can obviously say that the vector [1,0]=[1,1,,1,0]T\mathbf{[1, 0]} = [1,1,\dots,1,0]^{T} is in the null space of AA since the rows of the matrix PP sums to 1 as it is a probability vector, and the I-I in the matrix AA is used to subtract 1 from each row, which makes the row-sum of PIP-I to be 0, and the last column is 1. So, the product of AA and [1,0]\mathbf{[1, 0]} is 0.

[p111p12p1n1p21p221p2n1pn1pn2pnn11][1110]=[000]\left[ \begin{array}{cccc|c} p_{11} - 1 & p_{12} & \cdots & p_{1n} & 1 \\ p_{21} & p_{22} - 1 & \cdots & p_{2n} & 1 \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ p_{n1} & p_{n2} & \cdots & p_{nn} - 1 & 1 \end{array} \right] \begin{bmatrix} 1 \\ 1 \\ \vdots \\ 1 \\ 0 \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \\ \vdots \\ 0 \end{bmatrix}

Since we are dones with one vector which belongs to the nullspace of the matrix AA, we shall find the other vector. Let us say it is perpendicular to the vector [1T,0T]T[1^{T},0^{T}]^{T}, which we used earlier. We do this to show that there exist 2 independent vectors, that let you span atleast 2 dimensions, which was our assumption. Let us say it is [x,α]T\mathbf{[x, \alpha]}^{T}.

We definitely know that there will be some negative values in the vector [x,α]T\mathbf{[x,\alpha]}^{T} since if that is perpendicular to the vector [1,1,,1,0]T\mathbf{[1,1,\dots,1,0]}^{T} then the dot product is 0, which will happen only if there are some negative values in the vector [x,α]T\mathbf{[x,\alpha]}^{T}.

if you multiply the matrix AA with the vector [x,α]T\mathbf{[x, \alpha]}^{T} we get the following:

[p111p12p1n1p21p221p2n1pn1pn2pnn11][x1x2xnα]=[j=1npijxjxi+α]\left[ \begin{array}{cccc|c} p_{11} - 1 & p_{12} & \cdots & p_{1n} & 1 \\ p_{21} & p_{22} - 1 & \cdots & p_{2n} & 1 \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ p_{n1} & p_{n2} & \cdots & p_{nn} - 1 & 1 \end{array} \right] \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \\ \alpha \end{bmatrix} = \begin{bmatrix} \vdots \\ \sum_{j=1}^n p_{ij} x_j - x_i + \alpha \\ \vdots \end{bmatrix}

the extra xi-x_i comes from the part where the diagonal matrix is subtracted from the matrix PP. From this, we can see that each xix_i is the convex combination of the values in the i-th row of the matrix PP plus α\alpha.

Convex Combination

Convex sum of some objects is defined as, If there are nn objects o1,o2,,ono_1, o_2, \dots, o_n, then their convex combination is given by

i=1nαioi\sum_{i=1}^n \alpha_i o_i

where αi0\alpha_i \ge 0 for all ii and i=1nαi=1\sum_{i=1}^n \alpha_i = 1.

and here the α\alphas are none other than the values in the i-th row of the matrix PP.

Let SS be the set of all indices ii for which xix_i is having the maximum values. Let Sˉ\bar S be the set containing the minimal values. Connectedness of a markov chain implies that some xkx_k of maximum values is adjacent to some xlx_l of lower value. From this, we can say that

xk>j=1npkjxj.x_k > \sum_{j=1}^n p_{kj} x_j.

This is because the maximal value can never be less than or equal to the average (weighted sum / convex sum) of the nearby values.

This leads us to α\alpha being greater than 0 to be

xk=j=1npkjxj+αx_k = \sum_{j=1}^n p_{kj} x_j + \alpha

If we turn the tables and do the same for the minimal values and see then it will be

xl<j=1npljxjx_l < \sum_{j=1}^n p_{lj} x_j

This leads us to α\alpha being less than 0.

This is a contradiction due to our assumption of having 2 independent vectors in the null space of AA, as we assumed the nullspace of the matrix AA to be of 2 dimensions as we assumed the rank of AA to be n1n-1.

Hence, we can say that the rank of the matrix AA is nn.


Let us come back to the Theorem,

Theorem 1.0

Let PP be a regular transition matrix of a finite state Markov chain. Then there exists a unique stationary distribution π\pi such that πP=π\pi P = \pi. Moreover, for any starting distribution, limta(t)=π\lim_{t \to \infty} a(t) = \pi.

Here we need to note that a(t)\mathbf{a(t)} is the long-term average probability distribution, not the probability distribution at time tt.

Let us run one step of the Markov chain starting with distribution a(t)\mathbf{a(t)}: the distriburiton after the step is a(t+1)\mathbf{a(t+1)}. Let us calculate the change in probabilities due to this step.

a(t)Pa(t)=1t[p(0)P+p(1)P++p(t1)P]1t[p(0)+p(1)++p(t1)]=1t[p(1)+p(2)++p(t)]1t[p(0)+p(1)++p(t1)]=1t(p(t)p(0)).\begin{aligned} a(t)P - a(t) &= \frac{1}{t}\big[p(0)P + p(1)P + \cdots + p(t-1)P\big] - \frac{1}{t}\big[p(0) + p(1) + \cdots + p(t-1)\big] \\ &= \frac{1}{t}\big[p(1) + p(2) + \cdots + p(t)\big] - \frac{1}{t}\big[p(0) + p(1) + \cdots + p(t-1)\big] \\ &= \frac{1}{t}\big(p(t) - p(0)\big). \end{aligned}

Thus b(t)=a(t)Pa(t)\mathbf{b(t)} = a(t)P - a(t) is the difference between the average distribution after tt steps and the average distribution after t+1t+1 steps. It satisfies b(t)2t\mathbf{|b(t)|} \le \frac{2}{t} for all t10t \ge 1 \to 0 as tt \to \infty, due to the triangular law of vector addtion.

By lemma 1.1, we know that the rank of the matrix A=[PI,1]A = [P-I,1] is nn. Let us also give a closed form solution for π\pi. Let us take the n×(n+1)n \times (n+1) matrix AA and remove the first column to make a submatrix of size n×nn \times n. Let us call it BB and it is invertible because, if it was not invertible then the rank of AA would be less than nn, which contradicts our lemma 1.1. Let c(t)\mathbf{c(t)} be obtained by removing the first entry of b(t)\mathbf{b(t)}. Since a(t)Pa(t)b(t)\mathbf{a(t)P - a(t) - b(t)}, we have

a(t)B=[c(t),1]\mathbf{a(t)}B = \mathbf{[c(t), 1]}

Then

a(t)=[c(t),1]B1[0,1]B1\mathbf{a(t)} = \mathbf{[c(t), 1]}B^{-1} \to \mathbf{[0, 1]}B^{-1}

establishing the theorem with π=[0,1]B1\mathbf{\pi} = \mathbf{[0, 1]}B^{-1}.