What is Ergodic Theory? In this post, I present some wonderful and intuitive proofs for the ergodic theorems and describe two applications of the ergodic theorem.
March 18, 2023 · 16 min · 3361 words · Eitan Porat | Suggest Changes
Suppose you want to model the dynamics of a gas inside a box. We model the gas as a bunch of molecules, which can hit each other and the box they are enclosed in. We consider the situation in which there is no loss of energy (elastic collisions, friction, heat, etc.).
We can write physical equations of motions to model this gas. The problem is that the number of balls is enormous, and we cannot describe the position and momentum of each particle. Nor can we solve these equations efficiently. What we can do is observe some macroscopic physical quantity of the physical system (such as temperature, heat capacity, etc.).
In mathematical terms, we have N particles with positions q1,q2,…,qN (each 3-dimensional vectors) which momenta p1,p2,…,pN (also 3-dimensional vectors). We can collect all variables into a single variable x=(q1,q2,…,qN,p1,p2,…,pN), the state of the system. What we are interested in measuring is f(x). But since molecules move very quickly ≈500m/sec, we assume that the period it takes to take a physical measurement is actually a long time for the molecules. That is, assuming that the system evolves at time t as Pt(x) we are actually measuring τ1∫0τf(Pt(x))dt, where τ is the period of time it takes to take a physical measurement. But as we assumed τ is large (since each molecule moves so quickly), it is safe to assume we are measuring τ→∞limτ1∫0τf(Pt(x))dt Suppose the system has a fixed energy C. What the ergodic theorem states is that τ→∞limτ1∫0τf(Pt(x))dt=∫Xf(x)dμ, where X={x∣E(x)=C} a (compact) manifold of states with energy C and μ which is the uniform measure (say Borel) over all states in the manifold.
Intuitively, this means that taking a time-average measurement of a physical quantity is equivalent to an average of the physical quantity over all possible states.
At this point we moved from a theoretical understanding of the subject to a mathematical perspective, we are trying to capture the essential conditions for which the ergodic theorem is true. We are trying to develop a useful framework that will allow us to generalize even further.
In general, we are given some space X (a set), with some transformation T (remember the time evolution operator Pt we saw before). What might this transformation be in general?
Consider an ensemble of particles each with position x and momentum p. The area of the blue pixels (the phase-space distribution) is constant along the time-evolution of the system. This result is known as Liouville’s theorem.
Ah-ha! This transformation must preserve the volume of the phase space (the space of positions and momenta).
To capture this notion, we define a measure μ to be a function which takes as input sets and returns values between [0,1]. This function has some other properties, such as μ(∅)=0 and μ(X)=1, and μ(⋃iAi)=∑iμ(Ai) for a countable collection of disjoint sets {A1,A2,….}. Click here for a more rigorous definition.
The measure needs to be T-preserving in the sense that for every measurable set E we have μ(E)=μ(T−1E).
a not very important aside… the reason we write T−1 and not T is for technical reasons (T might not be invertible in general).
A system is said to be ergodic if there are no stationary sets except for sets which are of zero-measure or complements of sets of zero measure i.e. if T−1(E)=E then μ(E)=0 or μ(X∖E)=0. In other words, a set can be stationary only if it is “almost everything” or “almost nothing”.
We could imagine that the gas molecules are never confined to a specific part of the phase space, for example a smoke inside a room might eventually fills the room. It is worth noting that for most physical systems, we don’t know whether they are ergodic or not and we usually assume that this is true.
Proof of the Mean-Ergodic Theorem (Von Neumann `31)#
We will start by proving the mean-ergodic theorem which is an easier variant to prove than the one states earlier.
For the proof of the mean ergodic theorem we only consider functions which are in L2 this means that ∫X∣f(x)∣2dμ<∞. These functions form a Hilbert space, which is useful because we can apply some of our intuitions from linear algebra to these sorts of spaces. Such as the notion of inner product ⟨f,g⟩=∫Xf(x)g(x)dμ. We will prove a discrete version of the mean ergodic theorem that is
N→∞limN1i=1∑Nf∘Ti−1⟶L2∫Xfdμ, important to note that this is not pointwise convergencebut rather in L2 (which is a less strict requirement).
An important lemma - invariant functions are constant (almost everywhere)!#
In ergodic systems, T-invariant functions (a.e.) are constant (a.e.). Let f be a function such that f=f∘T (a.e.). Consider the set [f>t] that is {x∣f(x)>t}. The preimage of [f>t] under T is T−1[f>t]={x∣f∘T(x)>t}={x∣f(x)>t}=[f>t].
By ergodicity, μ([f>t])=0 or μ([f>t])=1 and μ([f>t]) is monotonically decreasing so there must be some critical value t∗ for which μ([f>t∗])=0 but μ([f>t])=1 for every t<t∗, so f is constant and equal to t∗ (up to a set of zero measure).
A trivial case to consider is when f=f∘T that is f is T-invariant, we know in this case that f is constant. In this case it is obvious that N→∞limN1i=1∑Nf(Ti−1x)=f(x)=∫f(x)dμ
Insight 1: the time-average of a T-invariant function f converges to the function f.
We could think of invariant functions as the kernel space of id−U where U(f)=f∘T is some operator acting of functions. It seems natural to consider the image space of id−U, that is functions of the form g−g∘T for g∈L2 we will call this subspace C (also called the coboundary functions). The time average of the function g−g∘T is a telescoping sum N1i=1∑Ng(Ti−1x)−g(Tix)=N1(g(x)−g(TNx)), so N1∥∥∥g(x)−g(TNx)∥∥∥2≤N2∥g∥2⟶N→∞0.
So the time average of g−g∘T converges in L2 to 0. This makes sense since the measurement of observed physical quantities don’t change drastically after infinitesimal time steps.
Insight 2: functions in L2 can be decomposed into the sum of an invariant function and a coboundary function
Technical comment (not important…)
It is easy to show that this also holds for functions in the closure of Cˉ. We now want to show that L2 can be decomposed into {invariant functions}+C. The reason we care about the closure of the space is that C is not necessarily closed.
Obviously a function which is both invariant and in C must be the zero function, thus the sets are disjoint.
Assume that there exists non-zero function f which is orthogonal to {invariant functions}+Cˉ. Thus f is orthogonal to f−f∘T, ∥f−f∘T∥22=∥f∥22+∥f∘T∥22−2⟨f,f∘T⟩ since T is measure preserving, it follows from a simple argument about simple functions that ∥f∥22=∥f∘T∥22 plugging back in we see that ∥f−f∘T∥22=2⟨f,f−f∘T⟩=0. Thus, f=f∘T, by definition this means that f is invariant contrary to our assumption that f is orthogonal to the {invariant functions}.
This means that any function f in L2 can be written as a some of two functions: one which is invariant and remains constant after computing its time average and another one which converges in L2 norm to 0. Thus, the time-average of f converges to an invariant function, say f.
As we saw previously in the lemma, this function must be constant. Which constant?
Proving that the time-average and phase-average are equal#
Since fˉ is constant a.e., f=∫Xfdμ an alternative way to compute this integral is to compute the inner product ⟨f,1⟩ Combining with the fact that N1i=1∑Nf∘Ti−1⟶f we see that f=⟨N→∞limN1i=1∑Nf∘Ti−1,1⟩ and by linearity of inner product f=N→∞limN1i=1∑N⟨f∘Ti−1,1⟩ using the fact that T is measure preserving and therefore preserves inner products we get our desired result N1i=1∑N∫f(Ti−1x)dμ=∫Xfdμ=fˉ■
Another intuition is that we might think of fˉ as a projection of f on the space of invariant functions (constant almost everywhere). fˉ=argα∈Rmin∥f−α∥2, from Gauss-Markov Theorem the constant which minimizes the L2 loss function is the mean of f, ∫Xfdμ.
One year after Von Neumann proved the mean ergodic theorem, Birkhoff discovered a similar result. It states that for a function in L1 (∫X∣f(x)∣dμ<∞) N→∞limN1i=1∑Nf(Ti−1x)=∫Xf(x)dμ almost everywhere. Notice that convergence here is pointwise.
In Kolmogorov’s interpretation of probability - random variables are essentially functions fi:X→R. The law of large numbers states that for any collection of identically distributed variables f1,f2,…,fN (with the same distribution as f) their mean converges to fNf1+f2+⋯+fN⟶E[f]. We can view ergodicity as a condition for generating “random” behaviour. Birkhoff’s theorem shows us that ergodicity can give us another estimate for the expectation of f.
so evaluating E[f]=∫Xf(x)dμ is equivalent to sampling x∈X arbitrarily and computing Nx+f(T(x))+⋯+f(Tn−1(x)), notice how this expression is totally deterministic whereas sampling i.i.d. variables requires us to assume we have some randomness.
In some sense, ergodic systems give rise to “random” behaviour.
Any function f∈L1 may be decomposed into two functions f=f+−f− which are each non-negative. If we prove that the theorem holds for f+ and f− individually from linearity we will conclude that the theorem hold. We may assume henceforth that f is non-negative.
The limsup of a sequence always exists, let’s denote Aˉ(x)=n→∞limsupAn(x) at this moment we don’t know that An(x) is bounded, so Aˉ(x)∈[0,∞], we may even drop the argument and write Aˉ instead of Aˉ(x) since Aˉ(x) is constant almost everywhere.
Proof that Aˉ is constant almost everywhere.
Notice that An(T(x))=n1i=1∑nf(Tix) and so An(T(x)) and An(x) differ by two terms i.e. An(T(x))−An(x)=nf(Tnx)−nf(x) as n grows nf(x) tends to zero. What is less obvious to show is that nf(Tnx) tends to zero.
Let ε>0 since T is measure-preserving μ(nf(Tnx)≥ε)=μ(f(x)≥nε) using a Markov-type argument, μ(f(Tnx)≥nε)≤nε∫Xf(x)dμ⟶n→∞0
This proves that Aˉ(x) is T invariant almost everywhere, and by the important lemma we see that Aˉ(x) is constant. ■
The first part of the proof is to establish why the limit limn→∞An(x) exists almost everywhere, we will show this by equating the limsup and liminf of An(x)n→∞limsupAn(x)=n→∞liminfAn(x).
For any arbitrary α<Aˉ, we will prove that n→∞liminfAn(x)≥α and this implies that n→∞liminfAn(x)≥n→∞limsupAn(x) since the limit inferior is always less than or equal to the limit supremum this would imply that n→∞liminfAn(x)=n→∞limsupAn(x) so the pointwise limit exists.
In the second part of the proof we show that Aˉ=∫Xfdμ
The condition that α<Aˉ implies that there for almost every x there exists a minimum ℓ(x) such that Aℓ(x)>α.
Insight 1: Long enough subsequences have an average f value which exceeds α
What does this mean? It means that starting from any x there exists ℓ(x) such that ℓ(x)1(f(x)+f(Tx)+⋯+f(Tℓ(x)−1x))>α.Insight 2: We can decompose the sequence x,T(x),T2(x),…,Tn−1(x) into subsequences which each has an average value greater than α.
Let’s see what this means pictorially.
Each subsequence has average value greater than α.
The sequence is precisely defined as k0k1k2⋮ki=ℓ(x)=k0+ℓ(Tk0x)=k1+ℓ(Tk1x)=ki−1+ℓ(Tki−1x)
If we assume that ℓ(Tkx)≤L for every k we can approximately cover x,T(x),T2(x),…,Tn−1(x) with an exception to the end point (ki might not necessarily equal n−1, we will assume that n−L≤ki≤n) and so An(x)=nf(x)+f(T(x))+⋯+f(Tn−1x)≥nf(x)+f(T(x))+⋯+f(Tk0−1x)+nf(Tk0x)+⋯+f(Tk1−1x)+⋯+nf(Tki−1x)+⋯+f(Tki−1x)=nAk0(x)⋅k0+nAk1(Tk0x)⋅k1+…+nAki(Tki−1x)⋅ki≥nα(k0+⋯+ki)≥αnn−L As n→∞, n→∞liminfAn(x)≥α. So the limit exists!
We can now define a new function f~ which is defined in the following way f~(x)={f(x)αℓ(x)≤Lℓ(x)>L≥f(x), similarly, we can define A~n(x)=n1i=1∑nf~(Ti−1x). Consider the function g(x)={0α−f(x)ℓ(x)≤Lℓ(x)>L with time average Bn=n1i=1∑ng(Ti−1x) then f~=f(x)+g(x) and A~n(x)=An(x)+Bn(x).
We can define similarly ℓ~(x)=min{n∣A~n(x)≥α} By definition ℓ~(x)={ℓ(x)1ℓ(x)≤Lℓ(x)>L
What have we done? We introduced a correction term g(x) so that f~ has bounded ℓ~(x).
Insight 4: We can reduce the case were ℓ(x) is unbounded to the case where ℓ(x) is bounded
Having reduced to the case that ℓ~(x)≤L we can prove as before the inequality A~n(x)≥nn−Lα taking the integral on both sides nn−Lα≤∫XA~n(x)dμ=∫Xn1i=1∑nf~(Ti−1x)dμ=∫Xf~dμ since T is measure preserving. ∫Xfdμ+∫Xgdμ=∫Xf~dμ≥nn−Lα since g(x)≤α with support ε we get a lower bound for ∫Xfdμ, ∫Xfdμ≥∫Xf~dμ−εα≥α(nn−L−ε) indeed this is true for any ε and any α, and any n. By taking ε→0, α→Aˉ and n→∞Aˉ≤∫Xfdμ.
This equality holds for any f∈L1! We’re very close to proving the first part of the theorem. What remains to prove is that n→∞liminfAn(x)≥Aˉ.
Proving that the limit inferior and superior are equal#
Notice how g∈L1 as such, by the useful inequality, ∫Xgdμ≥n→∞limsupn1i=1∑ng(Ti−1x) having shown that ∫Xgdμ≤αε,n→∞limsupBn≤αε by reducing to the warmup case, liminfA~n≥limsupAn=Aˉ The well-known inequality liminfAn+limsupBn≥liminf(An+Bn)=liminfA~n provides us a hint for how to prove a lower bound for liminfAn. Combining both inequalities we see that liminfAn≥liminfA~n−limsupBn≥Aˉ−αε≥α(1−ε) taking α→Aˉ and ε→0liminfAn≥Aˉ, Proving that the limit limn→∞An(x)=Aˉ exists almost everywhere.
we can view An(x) as a bounded sequence of functions which converge pointwise to Aˉ and by the bounded convergence theorem, ∫Xfdμ=∫XAndμ⟶n→∞∫XAˉdμ=Aˉ
We have already shown that for any f∈L1, Aˉ≤∫Xfdμ. What remains to show is that Aˉ≥∫Xfdμ.
Let M∈R, consider the bounded function f∧M=min(f(x),M), An(x)≥n1i=1∑nf∧M(Ti−1x) taking limits of both sides Aˉ≥n→∞limn1i=1∑nf∧M(Ti−1x) reducing to the previous case Aˉ≥∫Xf∧Mdμ but since this is true from every M, taking M→∞, Aˉ≥∫Xfdμ■
Gelfand’s problem states that the first digit of an=(kn)n∈N
follows the distribution P(i)=log10(ii+1) for k=10m.
The distribution of first digits. Each bar represents a digit, and the height of the bar is the percentage of numbers that start with that digit (Credit: Wikipedia)
Before we prove this fact, we need a lemma about the ergodicity of irrational shifts.
Lemma: Tα(x)→x+αmod1 is ergodic for α∈/Q in the measure space [0,1] with uniform measure (Borel measure).#
We will prove that there are no Tα invariant functions (in L2) which are not constant a.e. Let f∈L2([0,1]), it has a Fourier representation as series
f(x)=k∈Z∑f^(k)ei2πkx, so f∘T(x)=k∈Z∑f^(k)ei2πk(x+α) setting both expressions equal, since the Fourier basis is an orthogonal set all coefficients are equal so ei2πkx=ei2πk(x+α)=ei2πkαei2πkx But since α is irrational ei2πkα=1 for k=0, so ei2πkx=0, which means f is constant a.e.
Suppose E is a T-invariant set, so T−1(E)=E so x∈E⟺T(x)∈E. Therefore 1E=1E∘T which implies that 1E is constant almost everywhere so either μ(E)=0 or μ(E)=1, hence the system is ergodic.
Advanced comment:Tα is actually *uniquely ergodic*, which means that there exists only one measure μ for which Tα is invariant. It can be shown that for uniquely ergodic transformations n→∞limAn(x)⟶∫Xf(x)dμ. holds *everywhere*.
We represent the binary expansion of x as x1…xp. Notice that x1⋅10p−1≤x1…xp<(x1+1)⋅10p−1 taking log10 of both sides log10(x1)+(p−1)≤log10(x)<(log10(x1)+1)+(p−1) and so log10(x1)≤{log10(x)}<(log10(x1)+1) Where {log10(x)} is the fractional part of log10(x). So for any number x that starts with the digit i, log10(i)≤{log10(x)}<(log10(i)+1)
Consider the number x = kn, log10(x1)≤{nlog10(k)}<(log10(x1)+1) Consider the action T(x)↦x+log10kmod1, log10k is irrational and so this action is uniquely ergodic with invariant measure which is uniform on [0,1].
Consider the function fi(x)=1log10(i)≤x≤log10(i+1) by Birkhoff’s theorem (for uniquely ergodic maps), P(i)=Pr(first digit of kn is i)=n→∞limn1j=1∑nfi(Tj0)⟶∫fidμ=μ[log10(i),log10(i+1))=log10(ii+1).■
Consider the case when f=1A the indicator function for some set A, formally defined as 1A={10x∈Ax∈/A we see that N→∞limN∣{1≤i≤N∣Ti−1(x)∈A}∣=μ(A) i.e. we observe the fraction of times x is enters the set A over time and we see that it converges to the measure of the set. This implies that for every set A of non-zero measure and some configuration x there exists some time n for which Tn(x)∈A (Poincaré recurrence theorem). The assumption about the system being ergodic can be weakened.