Introduction

Suppose you want to model the dynamics of a gas inside a box. We model the gas as a bunch of molecules, which can hit each other and the box they are enclosed in. We consider the situation in which there is no loss of energy (elastic collisions, friction, heat, etc.).

Particles in a box

We can write physical equations of motions to model this gas. The problem is that the number of balls is enormous, and we cannot describe the position and momentum of each particle. Nor can we solve these equations efficiently. What we can do is observe some macroscopic physical quantity of the physical system (such as temperature, heat capacity, etc.).

In mathematical terms, we have NN particles with positions q1,q2,,qNq_{1},q_{2},\dots,q_{N} (each 3-dimensional vectors) which momenta p1,p2,,pNp_{1},p_{2},\dots,p_{N} (also 3-dimensional vectors). We can collect all variables into a single variable x=(q1,q2,,qN,p1,p2,,pN),x=(q_{1},q_{2},\dots,q_{N},p_{1},p_{2},\dots,p_{N}), the state of the system. What we are interested in measuring is f(x)f(x). But since molecules move very quickly 500m/sec\approx 500 m/sec, we assume that the period it takes to take a physical measurement is actually a long time for the molecules. That is, assuming that the system evolves at time tt as Pt(x)P_t(x) we are actually measuring 1τ0τf(Pt(x))dt,\frac{1}{\tau}\int_{0}^{\tau} f(P_{t}(x))dt, where τ\tau is the period of time it takes to take a physical measurement. But as we assumed τ\tau is large (since each molecule moves so quickly), it is safe to assume we are measuring limτ1τ0τf(Pt(x))dt\lim_{\tau \to \infty} \frac{1}{\tau}\int_{0}^{\tau} f(P_{t}(x))dt Suppose the system has a fixed energy CC. What the ergodic theorem states is that limτ1τ0τf(Pt(x))dt=Xf(x)dμ,\lim_{\tau \to \infty} \frac{1}{\tau}\int_{0}^{\tau} f(P_{t}(x))dt=\int_{\mathcal{X}}{f(x)d\mu}, where X={xE(x)=C}\mathcal{X} = \lbrace x\mid E(x)=C\rbrace a (compact) manifold of states with energy CC and μ\mu which is the uniform measure (say Borel) over all states in the manifold.

Intuitively, this means that taking a time-average measurement of a physical quantity is equivalent to an average of the physical quantity over all possible states.

Formalizing even further

At this point we moved from a theoretical understanding of the subject to a mathematical perspective, we are trying to capture the essential conditions for which the ergodic theorem is true. We are trying to develop a useful framework that will allow us to generalize even further.

In general, we are given some space X\mathcal{X} (a set), with some transformation TT (remember the time evolution operator PtP_{t} we saw before). What might this transformation be in general?

Consider an ensemble of particles each with position xx and momentum pp. The area of the blue pixels (the phase-space distribution) is constant along the time-evolution of the system. This result is known as Liouville’s theorem. Louville theorem

Ah-ha! This transformation must preserve the volume of the phase space (the space of positions and momenta).

To capture this notion, we define a measure μ\mu to be a function which takes as input sets and returns values between [0,1][0,1]. This function has some other properties, such as μ()=0\mu(\emptyset)=0 and μ(X)=1\mu(\mathcal{X})=1, and μ(iAi)=iμ(Ai)\mu\left(\bigcup_{i}A_{i}\right)=\sum_{i}\mu(A_{i}) for a countable collection of disjoint sets {A1,A2,.}\lbrace A_{1}, A_{2}, …. \rbrace. Click here for a more rigorous definition.

The measure needs to be TT-preserving in the sense that for every measurable set EE we have μ(E)=μ(T1E).\mu(E)=\mu(T^{-1}E). a not very important aside… the reason we write T1T^{-1} and not TT is for technical reasons (TT might not be invertible in general).

A system is said to be ergodic if there are no stationary sets except for sets which are of zero-measure or complements of sets of zero measure i.e. if T1(E)=ET^{-1}(E)=E then μ(E)=0\mu(E)=0 or μ(XE)=0\mu(\mathcal{X}\setminus E)=0. In other words, a set can be stationary only if it is “almost everything” or “almost nothing”.

We could imagine that the gas molecules are never confined to a specific part of the phase space, for example a smoke inside a room might eventually fills the room. It is worth noting that for most physical systems, we don’t know whether they are ergodic or not and we usually assume that this is true.

Proof of the Mean-Ergodic Theorem (Von Neumann `31)

We will start by proving the mean-ergodic theorem which is an easier variant to prove than the one states earlier.

For the proof of the mean ergodic theorem we only consider functions which are in L2L^2 this means that Xf(x)2dμ<.\int_{\mathcal{X}}|f(x)|^{2}d\mu < \infty. These functions form a Hilbert space, which is useful because we can apply some of our intuitions from linear algebra to these sorts of spaces. Such as the notion of inner product f,g=Xf(x)g(x)dμ.\langle f,g \rangle = \int_{\mathcal{X}}f(x)g(x)d\mu. We will prove a discrete version of the mean ergodic theorem that is limN1Ni=1NfTi1L2Xfdμ,\lim_{N\to \infty} \frac{1}{N}\sum_{i=1}^{N} f\circ T^{i-1} \overset{L^2}{\longrightarrow}\int_{\mathcal{X}}{f d\mu}, important to note that this is not pointwise convergencebut rather in L2L^2 (which is a less strict requirement).

An important lemma - invariant functions are constant (almost everywhere)!

In ergodic systems, TT-invariant functions (a.e.) are constant (a.e.). Let ff be a function such that f=fTf=f\circ T (a.e.). Consider the set [f>t][f>t] that is {xf(x)>t}\lbrace x \mid f(x) > t \rbrace. The preimage of [f>t][f>t] under TT is T1[f>t]={xfT(x)>t}={xf(x)>t}=[f>t].T^{-1}[f>t] = \lbrace x \mid f\circ T(x) > t \rbrace = \lbrace x \mid f(x) > t \rbrace = [f>t].

By ergodicity, μ([f>t])=0\mu([f>t])=0 or μ([f>t])=1\mu([f>t])=1 and μ([f>t])\mu([f>t]) is monotonically decreasing so there must be some critical value tt^\ast for which μ([f>t])=0\mu([f>t^\ast])=0 but μ([f>t])=1\mu([f>t])=1 for every t<tt < t^\ast, so ff is constant and equal to tt^\ast (up to a set of zero measure).

Figure

Invariant functions

A trivial case to consider is when f=fTf=f\circ T that is ff is TT-invariant, we know in this case that ff is constant. In this case it is obvious that limN1Ni=1Nf(Ti1x)=f(x)=f(x)dμ\lim_{N\to \infty} \frac{1}{N}\sum_{i=1}^{N} f(T^{i-1}x)=f(x)=\int f(x)d\mu

Insight 1: the time-average of a TT-invariant function ff converges to the function ff.

Key intuition

We could think of invariant functions as the kernel space of idU\mathrm{id} - U where U(f)=fTU(f) = f\circ T is some operator acting of functions. It seems natural to consider the image space of idU\mathrm{id} - U, that is functions of the form ggTg-g\circ T for gL2g\in L^2 we will call this subspace C\mathcal{C} (also called the coboundary functions). The time average of the function ggTg-g\circ T is a telescoping sum 1Ni=1Ng(Ti1x)g(Tix)=1N(g(x)g(TNx)),\frac{1}{N}\sum_{i=1}^{N}g(T^{i-1}x)-g(T^{i}x)=\frac{1}{N}(g(x)-g(T^{N}x)), so 1Ng(x)g(TNx)22Ng2N0.\frac{1}{N}\left\Vert g(x)-g(T^{N}x) \right\Vert_2 \leq \frac{2}{N}\left\Vert g\right\Vert_2 \overset{N\to \infty}{\longrightarrow} 0.

So the time average of ggTg-g\circ T converges in L2L^2 to 00. This makes sense since the measurement of observed physical quantities don’t change drastically after infinitesimal time steps.

Insight 2: functions in L2L^2 can be decomposed into the sum of an invariant function and a coboundary function

Technical comment (not important…) It is easy to show that this also holds for functions in the closure of Cˉ.\bar{\mathcal{C}}. We now want to show that L2L^2 can be decomposed into {invariant functions}+C\lbrace\text{invariant functions}\rbrace +\overline{\mathcal{C}}. The reason we care about the closure of the space is that C\mathcal{C} is not necessarily closed.

Technical part of the proof

Obviously a function which is both invariant and in C\mathcal{C} must be the zero function, thus the sets are disjoint.

Assume that there exists non-zero function ff which is orthogonal to {invariant functions}+Cˉ\lbrace\text{invariant functions}\rbrace + \bar{\mathcal{C}}. Thus ff is orthogonal to ffTf-f\circ T, ffT22=f22+fT222f,fT\left \Vert f-f\circ T\right \Vert_{2}^{2} = \left \Vert f \right \Vert_{2}^{2} + \left \Vert f \circ T \right \Vert_{2}^{2} - 2\langle f, f\circ T \rangle since TT is measure preserving, it follows from a simple argument about simple functions that f22=fT22\left \Vert f \right \Vert_{2}^{2} = \left \Vert f \circ T \right \Vert_{2}^{2} plugging back in we see that ffT22=2f,ffT=0.\left \Vert f-f\circ T\right \Vert_{2}^{2} = 2\langle f, f - f\circ T \rangle = 0. Thus, f=fTf=f\circ T, by definition this means that ff is invariant contrary to our assumption that ff is orthogonal to the {invariant functions}\lbrace\text{invariant functions}\rbrace.

This means that any function ff in L2L^2 can be written as a some of two functions: one which is invariant and remains constant after computing its time average and another one which converges in L2L^2 norm to 00. Thus, the time-average of ff converges to an invariant function, say f\overline{f}.

As we saw previously in the lemma, this function must be constant. Which constant?

Proving that the time-average and phase-average are equal

Explanation 1

Since fˉ\bar{f} is constant a.e., f=Xfdμ\overline{f} = \int_{\mathcal{X}}\overline{f}d\mu an alternative way to compute this integral is to compute the inner product f,1\langle \overline{f}, 1\rangle Combining with the fact that 1Ni=1NfTi1f\frac{1}{N}\sum_{i=1}^{N}f\circ T^{i-1}\longrightarrow \overline{f} we see that f=limN1Ni=1NfTi1,1\overline{f} = \left \langle \lim_{N\to \infty}\frac{1}{N}\sum_{i=1}^{N}f\circ T^{i-1}, 1\right \rangle and by linearity of inner product f=limN1Ni=1NfTi1,1\overline{f} = \lim_{N\to \infty} \frac{1}{N}\sum_{i=1}^{N}\left \langle f\circ T^{i-1}, 1\right \rangle using the fact that TT is measure preserving and therefore preserves inner products we get our desired result 1Ni=1Nf(Ti1x)dμ=Xfdμ=fˉ\frac{1}{N}\sum_{i=1}^{N} \int f(T^{i-1}x) d\mu = \int_{\mathcal{X}}{fd\mu} = \bar{f} \blacksquare

Explanation 2

Another intuition is that we might think of fˉ\bar{f} as a projection of ff on the space of invariant functions (constant almost everywhere). fˉ=argminαRfα2,\bar{f} = \arg\min_{\alpha \in \mathbb{R}}{\left \Vert f-\alpha \right \Vert_{2}}, from Gauss-Markov Theorem the constant which minimizes the L2L^2 loss function is the mean of ff, Xfdμ\int_{\mathcal{X}} f d\mu.

Pointwise Ergodic theorem (Birkhoff `32)

One year after Von Neumann proved the mean ergodic theorem, Birkhoff discovered a similar result. It states that for a function in L1L^1 (Xf(x)dμ<\int_{\mathcal{X}}|f(x)|d\mu < \infty) limN1Ni=1Nf(Ti1x)=Xf(x)dμ\lim_{N\to \infty} \frac{1}{N}\sum_{i=1}^{N} f(T^{i-1}x)=\int_{\mathcal{X}} f(x)d\mu almost everywhere. Notice that convergence here is pointwise.

In Kolmogorov’s interpretation of probability - random variables are essentially functions fi:XRf_{i}:\mathcal{X}\to \mathbb{R}. The law of large numbers states that for any collection of identically distributed variables f1,f2,,fNf_{1}, f_{2}, \dots, f_{N} (with the same distribution as ff) their mean converges to ff f1+f2++fNNE[f].\frac{f_{1}+f_{2}+\dots+f_{N}}{N}\longrightarrow \mathbb{E}[f]. We can view ergodicity as a condition for generating “random” behaviour. Birkhoff’s theorem shows us that ergodicity can give us another estimate for the expectation of ff. so evaluating E[f]=Xf(x)dμ\mathbb{E}[f]=\int_{\mathcal{X}}f(x)d\mu is equivalent to sampling xXx\in \mathcal{X} arbitrarily and computing x+f(T(x))++f(Tn1(x))N,\frac{x+f(T(x))+\dots+f(T^{n-1}(x))}{N}, notice how this expression is totally deterministic whereas sampling i.i.d. variables requires us to assume we have some randomness.

In some sense, ergodic systems give rise to “random” behaviour.

Proof of the Pointwise Ergodic theorem

This proof is somewhat more complicated than the previous one, I hope the illustration helps explain the underlying construction.

For every xXx\in \mathcal{X} consider the series An(x)=1ni=1nf(Ti1x)A_{n}(x) = \frac{1}{n}\sum_{i=1}^{n}{f(T^{i-1}x)} our goal is to prove that almost everywhere limnAn(x)Xf(x)dμ.\lim_{n\to \infty}A_{n}(x)\longrightarrow \int_{\mathcal{X}}f(x)d\mu.

Some preliminary preparations

Any function fL1f \in L^1 may be decomposed into two functions f=f+ff = f_{+} - f_{-} which are each non-negative. If we prove that the theorem holds for f+f_{+} and ff_{-} individually from linearity we will conclude that the theorem hold. We may assume henceforth that ff is non-negative.

The limsup of a sequence always exists, let’s denote Aˉ(x)=limnsupAn(x)\bar{A}(x)=\lim_{n\to \infty}\sup A_{n}(x) at this moment we don’t know that An(x)A_{n}(x) is bounded, so Aˉ(x)[0,]\bar{A}(x) \in [0, \infty], we may even drop the argument and write Aˉ\bar{A} instead of Aˉ(x)\bar{A}(x) since Aˉ(x)\bar{A}(x) is constant almost everywhere.

Proof that Aˉ\bar{A} is constant almost everywhere.

Notice that An(T(x))=1ni=1nf(Tix)A_{n}(T(x)) = \frac{1}{n}\sum_{i=1}^{n}{f(T^i x)} and so An(T(x))A_{n}(T(x)) and An(x)A_{n}(x) differ by two terms i.e. An(T(x))An(x)=f(Tnx)nf(x)nA_{n}(T(x)) - A_{n}(x) = \frac{f(T^n x)}{n} - \frac{f(x)}{n} as nn grows f(x)n\frac{f(x)}{n} tends to zero. What is less obvious to show is that f(Tnx)n\frac{f(T^n x)}{n} tends to zero.

Let ε>0\varepsilon > 0 since TT is measure-preserving μ(f(Tnx)nε)=μ(f(x)nε)\mu\left(\frac{f(T^{n}x)}{n} \geq \varepsilon\right) = \mu(f(x) \geq n\varepsilon) using a Markov-type argument, μ(f(Tnx)nε)Xf(x)dμnεn0\mu\left(f(T^{n}x) \geq n\varepsilon\right) \leq \frac{\int_{\mathcal{X}}{f(x)d\mu}}{n \varepsilon}\overset{n\to \infty}{\longrightarrow} 0

This proves that Aˉ(x)\bar{A}(x) is TT invariant almost everywhere, and by the important lemma we see that Aˉ(x)\bar{A}(x) is constant. \blacksquare

The proof outline

The proof consists of two parts.

The first part of the proof is to establish why the limit limnAn(x)\lim_{n\to \infty}A_{n}(x) exists almost everywhere, we will show this by equating the limsup and liminf of An(x)A_{n}(x) limnsupAn(x)=limninfAn(x).\lim_{n\to \infty}\sup A_{n}(x)=\lim_{n\to \infty}\inf A_{n}(x).

For any arbitrary α<Aˉ\alpha < \bar{A}, we will prove that limninfAn(x)α\lim_{n\to \infty}\inf A_{n}(x) \geq \alpha and this implies that limninfAn(x)limnsupAn(x)\lim_{n\to \infty} \inf A_{n}(x) \geq \lim_{n \to \infty} \sup A_{n}(x) since the limit inferior is always less than or equal to the limit supremum this would imply that limninfAn(x)=limnsupAn(x)\lim_{n\to \infty}\inf A_{n}(x) = \lim_{n\to \infty}\sup A_{n}(x) so the pointwise limit exists.

In the second part of the proof we show that Aˉ=Xfdμ\bar{A}=\int_{\mathcal{X}}fd\mu

Part 1

Warmup

The condition that α<Aˉ\alpha < \bar{A} implies that there for almost every xx there exists a minimum (x)\ell(x) such that A(x)>αA_{\ell}(x) > \alpha.

Insight 1: Long enough subsequences have an average ff value which exceeds α\alpha What does this mean? It means that starting from any xx there exists (x)\ell(x) such that 1(x)(f(x)+f(Tx)++f(T(x)1x))>α.\frac{1}{\ell(x)}\left(f(x) + f(Tx)+\dots + f(T^{\ell(x) - 1} x)\right) > \alpha. Insight 2: We can decompose the sequence x,T(x),T2(x),,Tn1(x)x, T(x), T^2(x), \dots, T^{n-1}(x) into subsequences which each has an average value greater than α\alpha. Let’s see what this means pictorially.

Figure

Each subsequence has average value greater than α\alpha. The sequence is precisely defined as k0=(x)k1=k0+(Tk0x)k2=k1+(Tk1x)ki=ki1+(Tki1x)\begin{aligned}k_{0} &= \ell(x) \\ k_{1} &= k_{0} + \ell(T^{k_0}x) \\ k_{2} &= k_{1} + \ell(T^{k_{1}}x) \\ \vdots \\ k_{i} &= k_{i-1} + \ell(T^{k_{i-1}}x)\end{aligned}

If we assume that (Tkx)L\ell(T^k x) \leq L for every kk we can approximately cover x,T(x),T2(x),,Tn1(x)x, T(x), T^2(x), \dots, T^{n-1}(x) with an exception to the end point (kik_i might not necessarily equal n1n-1, we will assume that nLkinn-L \leq k_{i} \leq n) and so An(x)=f(x)+f(T(x))++f(Tn1x)nf(x)+f(T(x))++f(Tk01x)n+f(Tk0x)++f(Tk11x)n++f(Tki1x)++f(Tki1x)n=Ak0(x)k0n+Ak1(Tk0x)k1n++Aki(Tki1x)kinα(k0++ki)nαnLn\begin{aligned}A_{n}(x) &= \frac{f(x) + f(T(x)) + \dots + f(T^{n-1} x)}{n} \\ &\geq \frac{f(x) + f(T(x)) + \dots + f(T^{k_{0} - 1}x)}{n} \\ &+ \frac{f(T^{k_{0}}x) + \dots + f(T^{k_{1} - 1}x)}{n} + \dots + \frac{f(T^{k_{i-1}}x) + \dots + f(T^{k_{i} - 1}x)}{n}\\ &=\frac{A_{k_0}(x)\cdot k_{0}}{n} + \frac{A_{k_1}(T^{k_{0}} x) \cdot k_{1}}{n} + … + \frac{A_{k_{i}}(T^{k_{i-1}} x) \cdot k_{i}}{n}\\ &\geq \frac{\alpha(k_{0} + \dots + k_{i})}{n} \\ &\geq \alpha\frac{n-L}{n} \end{aligned} As nn \rightarrow \infty, limninfAn(x)α.\lim_{n\to \infty}\inf A_{n}(x) \geq \alpha. So the limit exists!

Back to the general case

How do we eliminate the assumption that (Tkx)L\ell(T^k x) \leq L for every kk?

Insight 3: We could assume that (x)L\ell(x) \leq L almost everywhere.

Why can we assume that?

Let ε>0\varepsilon>0 be a very small constant. Since (x)\ell(x) is finite almost everywhere there exists some LL for which μ{(x)>L}<ε\mu\lbrace \ell(x) > L \rbrace < \varepsilon.

Defining the bounded function f~\tilde{f}

We can now define a new function f~\tilde{f} which is defined in the following way f~(x)={f(x)(x)Lα(x)>Lf(x),\tilde{f}(x) = \begin{cases}f(x) & \ell(x)\leq L\\ \alpha & \ell(x) > L\end{cases} \geq f(x), similarly, we can define A~n(x)=1ni=1nf~(Ti1x).\tilde A_n(x) = \frac{1}{n}\sum_{i=1}^{n}\tilde f(T^{i-1}x). Consider the function g(x)={0(x)Lαf(x)(x)>Lg(x) = \begin{cases}0 & \ell(x)\leq L\\ \alpha - f(x) & \ell(x) > L\end{cases} with time average Bn=1ni=1ng(Ti1x)B_{n} = \frac{1}{n}\sum_{i=1}^{n}g(T^{i-1}x) then f~=f(x)+g(x)\tilde{f} = f(x) + g(x) and A~n(x)=An(x)+Bn(x)\tilde A_{n}(x) = A_{n}(x) + B_{n}(x).

We can define similarly ~(x)=min{nA~n(x)α}\tilde{\ell}(x) = \min\lbrace n \mid \tilde A_{n}(x) \geq \alpha \rbrace By definition ~(x)={(x)(x)L1(x)>L\tilde{\ell}(x) = \begin{cases}\ell(x) & \ell(x)\leq L\\ 1 & \ell(x) > L\end{cases}

What have we done? We introduced a correction term g(x)g(x) so that f~\tilde{f} has bounded ~(x)\tilde{\ell}(x).

Insight 4: We can reduce the case were (x)\ell(x) is unbounded to the case where (x)\ell(x) is bounded

A useful inequality

Having reduced to the case that ~(x)L\tilde{\ell}(x) \leq L we can prove as before the inequality A~n(x)nLnα\tilde A_{n}(x) \geq \frac{n-L}{n}\alpha taking the integral on both sides nLnαXA~n(x)dμ=X1ni=1nf~(Ti1x)dμ=Xf~dμ\frac{n-L}{n}\alpha \leq \int_{\mathcal{X}}\tilde A_{n}(x)d\mu = \int_{\mathcal{X}}\frac{1}{n}\sum_{i=1}^{n}\tilde f(T^{i-1}x)d\mu = \int_{\mathcal{X}}\tilde fd\mu since TT is measure preserving. Xfdμ+Xgdμ=Xf~dμnLnα\int_{\mathcal{X}}fd\mu+\int_{\mathcal{X}}g d\mu=\int_{\mathcal{X}}\tilde{f}d\mu \geq \frac{n-L}{n}\alpha since g(x)αg(x) \leq \alpha with support ε\varepsilon we get a lower bound for Xfdμ\int_{\mathcal{X}}f d\mu, XfdμXf~dμεαα(nLnε)\int_{\mathcal{X}}fd\mu \geq \int_{\mathcal{X}}\tilde{f}d\mu - \varepsilon \alpha \geq \alpha\left(\frac{n-L}{n}-\varepsilon\right) indeed this is true for any ε\varepsilon and any α\alpha, and any nn. By taking ε0\varepsilon \to 0, αAˉ\alpha \to \bar{A} and nn\to \infty AˉXfdμ.\boxed{ \bar{A} \leq \int_{\mathcal{X}}fd\mu}.

This equality holds for any fL1f\in L^1! We’re very close to proving the first part of the theorem. What remains to prove is that limninfAn(x)Aˉ.\lim_{n\to \infty}\inf A_n(x) \geq \bar{A}.

Proving that the limit inferior and superior are equal

Notice how gL1g \in L^1 as such, by the useful inequality, Xgdμlimnsup1ni=1ng(Ti1x) \int_{\mathcal{X}}gd\mu \geq \lim_{n\to \infty}\sup \frac{1}{n}\sum_{i=1}^{n}g(T^{i-1}x) having shown that Xgdμαε,\int_{\mathcal{X}}gd\mu \leq \alpha\varepsilon, limnsupBnαε\lim_{n\to \infty}\sup B_{n}\leq \alpha\varepsilon by reducing to the warmup case, liminfA~nlimsupAn=Aˉ\lim\inf\tilde A_n \geq \lim\sup A_n = \bar{A} The well-known inequality liminfAn+limsupBnliminf(An+Bn)=liminfA~n\lim\inf A_{n} + \lim\sup B_{n} \geq \lim\inf(A_n + B_n) = \lim\inf\tilde A_n provides us a hint for how to prove a lower bound for liminfAn\lim \inf A_{n}. Combining both inequalities we see that liminfAnliminfA~nlimsupBnAˉαεα(1ε)\lim\inf A_{n} \geq \lim\inf\tilde A_n - \lim\sup B_{n} \geq \bar{A} - \alpha\varepsilon \geq \alpha(1-\varepsilon) taking αAˉ\alpha \to \bar{A} and ε0\varepsilon \to 0 liminfAnAˉ,\lim\inf A_{n} \geq \bar{A}, Proving that the limit limnAn(x)=Aˉ\lim_{n\to \infty} A_{n}(x) = \bar{A} exists almost everywhere.

Part 2

The bounded case

we can view An(x)A_{n}(x) as a bounded sequence of functions which converge pointwise to Aˉ\bar{A} and by the bounded convergence theorem, Xfdμ=XAndμnXAˉdμ=Aˉ\int_{\mathcal{X}}fd\mu = \int_{\mathcal{X}}A_n d\mu \overset{n\to \infty}{\longrightarrow} \int_{\mathcal{X}}\bar{A} d\mu = \bar{A}

The unbounded case

We have already shown that for any fL1f\in L^1, AˉXfdμ\bar{A} \leq \int_{\mathcal{X}}fd\mu. What remains to show is that AˉXfdμ\bar{A} \geq \int_{\mathcal{X}}fd\mu.

Let MRM\in \mathbb{R}, consider the bounded function fM=min(f(x),M)f\wedge M = \min(f(x), M), An(x)1ni=1nfM(Ti1x)A_{n}(x) \geq \frac{1}{n}\sum_{i=1}^{n}f\wedge M(T^{i-1}x) taking limits of both sides Aˉlimn1ni=1nfM(Ti1x)\bar{A} \geq \lim_{n\to \infty}\frac{1}{n}\sum_{i=1}^{n}f\wedge M(T^{i-1}x) reducing to the previous case AˉXfMdμ\bar{A} \geq \int_{\mathcal{X}}f\wedge M d\mu but since this is true from every MM, taking MM\to \infty, AˉXfdμ\boxed{\bar{A} \geq \int_{\mathcal{X}}f d\mu} \blacksquare

Applications of Birkhoff’s Theorem

Gelfand’s problem

Gelfand’s problem states that the first digit of an=(kn)nNa_n = \left(k^{n}\right)_{n\in \mathbb N}

follows the distribution P(i)=log10(i+1i)P(i) = \log_{10}\left(\frac{i+1}{i}\right) for k10mk\neq 10^{m}.

The distribution of first digits. Each bar represents a digit, and the height of the bar is the percentage of numbers that start with that digit (Credit: Wikipedia)
Benford&rsquo;s Distribution

Before we prove this fact, we need a lemma about the ergodicity of irrational shifts.

Lemma: Tα(x)x+αmod  1T_\alpha(x) \to x + \alpha \mod 1 is ergodic for αQ\alpha \notin \mathbb{Q} in the measure space [0,1][0,1] with uniform measure (Borel measure).

We will prove that there are no TαT_\alpha invariant functions (in L2L^2) which are not constant a.e. Let fL2([0,1])f \in L^2([0,1]), it has a Fourier representation as series f(x)=kZf^(k)ei2πkx,f(x) = \sum_{k\in \mathbb{Z}}{\hat{f}(k)e^{i2\pi k x}}, so fT(x)=kZf^(k)ei2πk(x+α)f\circ T(x) = \sum_{k\in \mathbb{Z}}{\hat{f}(k)e^{i2\pi k (x + \alpha)}} setting both expressions equal, since the Fourier basis is an orthogonal set all coefficients are equal so ei2πkx=ei2πk(x+α)=ei2πkαei2πkxe^{i2\pi k x} = e^{i2\pi k(x+\alpha)} = e^{i2\pi k\alpha}e^{i2\pi k x} But since α\alpha is irrational ei2πkα1e^{i2\pi k\alpha} \neq 1 for k0k\neq 0, so ei2πkx=0e^{i2\pi k x}=0, which means ff is constant a.e.

Suppose EE is a TT-invariant set, so T1(E)=ET^{-1}(E)=E so xE    T(x)Ex\in E \iff T(x) \in E. Therefore 1E=1ET1_{E} = 1_{E}\circ T which implies that 1E1_{E} is constant almost everywhere so either μ(E)=0\mu(E)=0 or μ(E)=1\mu(E)=1, hence the system is ergodic.

Advanced comment: TαT_\alpha is actually *uniquely ergodic*, which means that there exists only one measure μ\mu for which TαT_\alpha is invariant. It can be shown that for uniquely ergodic transformations limnAn(x)Xf(x)dμ.\lim_{n\to \infty}A_{n}(x)\longrightarrow \int_{\mathcal{X}}f(x)d\mu. holds *everywhere*.

We don’t present a proof for this fact.

Proof of Gelfand’s Problem

We represent the binary expansion of xx as x1xpx_1 \dots x_p. Notice that x110p1x1xp<(x1+1)10p1x_1\cdot 10^{p - 1} \leq x_{1}…x_{p} < (x_1 + 1) \cdot 10^{p-1} taking log10\log_{10} of both sides log10(x1)+(p1)log10(x)<(log10(x1)+1)+(p1)\log_{10}(x_1) + (p-1) \leq \log_{10}(x) < (\log_{10}(x_1) + 1) + (p-1) and so log10(x1){log10(x)}<(log10(x1)+1)\log_{10}(x_1) \leq \lbrace \log_{10}(x) \rbrace < (\log_{10}(x_1) + 1) Where {log10(x)}\lbrace \log_{10}(x) \rbrace is the fractional part of log10(x)\log_{10}(x). So for any number xx that starts with the digit ii, log10(i){log10(x)}<(log10(i)+1)\log_{10}(i) \leq \lbrace \log_{10}(x) \rbrace < (\log_{10}(i) + 1)

Consider the number xx = knk^n, log10(x1){nlog10(k)}<(log10(x1)+1)\log_{10}(x_1) \leq \lbrace n \log_{10}(k) \rbrace < (\log_{10}(x_1) + 1) Consider the action T(x)x+log10kmod  1T(x) \mapsto x + \log_{10}k \mod 1, log10k\log_{10}k is irrational and so this action is uniquely ergodic with invariant measure which is uniform on [0,1][0,1].

Consider the function fi(x)=1log10(i)xlog10(i+1)f_i(x) = 1_{\log_{10}(i) \leq x \leq \log_{10}(i + 1)} by Birkhoff’s theorem (for uniquely ergodic maps), P(i)=Pr(first digit of kn is i)=limn1nj=1nfi(Tj0)fidμ=μ[log10(i),log10(i+1))=log10(i+1i).P(i) = \Pr(\text{first digit of }k^n\text{ is }i) \\ = \lim_{n\to \infty}\frac{1}{n}\sum_{j=1}^{n}f_{i}(T^{j}0) \\\longrightarrow \int f_i d\mu = \mu [\log_{10}(i), \log_{10}(i+1)) = \log_{10}\left(\frac{i+1}{i}\right). \blacksquare

Poincaré Recurrence Theorem

Consider the case when f=1Af=1_{A} the indicator function for some set AA, formally defined as 1A={1xA0xA1_{A}=\begin{cases}1 & x\in A \\ 0 & x\notin A\end{cases} we see that limN{1iNTi1(x)A}N=μ(A)\lim_{N\to \infty} \frac{|\lbrace 1\leq i \leq N \mid T^{i-1}(x) \in A \rbrace|}{N}=\mu(A) i.e. we observe the fraction of times xx is enters the set AA over time and we see that it converges to the measure of the set. This implies that for every set AA of non-zero measure and some configuration xx there exists some time nn for which Tn(x)AT^{n}(x)\in A (Poincaré recurrence theorem). The assumption about the system being ergodic can be weakened.