Optimizing Individual-Level Models for Group-Level Predictions.
Part 2 — The Median of a Sum of Independent Identically Distributed

Introduction

In Part 1 — An Analysis of Bias I explored bias in group-level predictions coming from individual-level models. There I used the following theorem by Peter Hall (“On the Limiting Behaviour of the Mode and Median of a Sum of Independent Variables”¹):

But why is that true? I found Hall’s paper to be somewhat difficult to penetrate, but very interesting. As a service to the community, in this post I will introduce the tools used in proving this result in as motivated and detailed a fashion as I can manage, and supply a variant of Hall’s proof in full. I’ve changed some of the steps in the proof to make the proof more natural, removing his reliance on a theorem of Esseen, and instead proving his result using a mathematical tool called Edgeworth expansions (which I introduce) directly. I have also filled in some of the crucial details missing from the original proof. The basic outline of the proof is very appealing: use an estimate of the error of the Central Limit Theorem to prove the result.

Edgeworth Expansions — Estimating the error of the Central Limit Theorem

The easiest way to prove the Central Limit Theorem (CLT) and many of its variants is by looking at the “characteristic function” of random variables. The characteristic function of a random variable X is defined as

The reason that characteristic functions are helpful is three-fold:

  1. The characteristic function of a sum of independent variables is the product of their respective characteristic functions:

    so that makes computations easy.

  1. Knowing the characteristic function of a random variable determines its distribution. In fact, the characteristic function of X is nothing other than the Fourier transform of the probability density function of X (if one exists), and so this is just Fourier duality — the reverse Fourier transform of the Fourier transform is the original function! (Caveat: Fourier duality suffers from a plurality of conventions due to the identification of the group of real numbers with its dual group being “unnatural”, in the Category Theory sense of that word. We will follow the convention that the Fourier transform of a function f(t) is

    and the reverse transform of a function f(t) is

    With this convention, the reverse transform of the characteristic function of is the pdf of X, if one exists.)

  1. Finally,

    this is Levy’s Continuity Theorem.

The standard proof of the CLT follows exactly this approach. By normalizing first, you can reduce CLT to:

One proves this by showing that

point-wise.

The crucial step is that

and therefore:

, where the last step is because the X_i’s are all identically distributed. To simplify notation let X:=X_1.

Showing that

pointwise is relatively easy, though the details are irrelevant to the rest of the discussion.

The idea behind Edgeworth expansions, which is the mathematical tool that Hall uses as motivation (albeit just stops short of using directly), is similar:

To be more specific, what one does is write a Maclaurin expansion of Log of the characteristic function of X, where Log is the principal branch of log, as

For those of you who like naming things, these κ_j’s are usually called the “cumulants” of X. Generally speaking, cumulants don’t have an intuitive interpretation, but the first three do: κ_1 = E(X)κ_2= V(X), and κ_3 = E((X — E(X))³). In particular, for us: κ_1 = 0 and κ_2 = 1.

Thus the chosen expansion of the characteristic function of is then:

Plug that into

and get

This last equality is simply using the Maclaurin expansion of ; each r_j is a polynomial with real coefficients of degree 3j, and can be computed if need be.

It now just remains to apply a reverse Fourier transform to get an expansion of the probability density of S_n. This is relatively easy — there’s a trick:

(Why? The degree 0 case is standard, and the monomial case follows by repeatedly deriving the degree 0 case; the general case follows the monomial case trivially.) The trick continues: deriving the probability density function of the standard normal distribution j times is actually easy to compute (by commuting the derivative and the integral), and for any j it’s always some polynomial times the pdf of the standard normal. (The polynomials arising in this way are called Hermite polynomials, but it is not crucial to determine their properties for our purposes.)

The bottom line of all of this is that:

So we just got an estimate of the error of the Central Limit Theorem!

Weeeellll…. It turns out Edgeworth expansions almost never converge… So the above is more of the motivational bedtime-story we tell about Edgeworth expansion. But as it turns out, under extremely mild conditions, Edgeworth expansions are “asymptotic expansions”, which is to say that for every k:

uniformly in x so long as E(|X|^(k+2))<∞ and X satisfies a condition called “Cramer’s Condition”, which is a weaker condition than having a continuous pdf.

Proving this requires a bit of heavy lifting, and I refer to Peter Hall’s book “The Bootstrap and Edgeworth Expansion”², Section 2.4 and Chapter 5, for details.

From Edgeworth expansions to an approximation of the median

In Hall’s paper the equivalent of this entire section is the phrase “it follows that”, with no explanation given. As you will see, the argument requires some nuance. But the details do work out…!

Let:

Since

it follows that:

Let’s plug that into the Edgeworth expansion above:

Note that since pdf of the standard normal is exponential, the numerator of

is bounded, and therefore this fraction converges to 0. This implies that

and therefore:

(We ultimately want to show that m_n converges, but this is a first step that is required for the remainder of the proof.)

Let’s now use the Maclaurin expansion:

Its radius of convergence is infinity. It follows that:

It can be shown that the p_1(x) term in the Edgeworth expansion is equal -κ_3(x²-1)/6, hence:

In particular:

Therefore we have that:

Which, after multiplying by the square root of n implies that:

We’re ready for the kill now. Using the Maclaurin expansion:

We can see that:

Which in particular is:

Therefore as n approaches infinity:

and we’re done!

Final note on a Hall-like result for non-i.i.d

Somewhat understandably, there are no asymptotic results about the median of a sum of independent but not identically distributed random variables. However, surprisingly, there does exist a variant of Edgeworth expansions that is available for non-i.i.d random variables that is ideal under some rather complex definition of being ideal. (See “Edgeworth Expansions of Distribution Functions of Independent Random Variables” by Bai Zhidong and Zhao Lincheng³.) It is conceivable that one can follow the recipe above to go from the Edgeworth expansion to an estimate of the median.

If this post got your interest and you would like to challenge yourself with similar problems, we are hiring! Please check our open positions.

GitHub: https://github.com/lumiata/tech_blog

Visit Lumiata at www.lumiata.com and follow on Twitter via @lumiata.

Find us on LinkedIn: www.linkedin.com/company/lumiata

Citations:

1.Hall, P. (1980) On the limiting behaviour of the mode and median of a sum of independent random variables. Ann. Probability 8 419–430.

2. Hall, P. (1992) The Bootstrap and Edgeworth Expansion. Springer-Verlag,
New York.

3. Bai, Z. and Zhao, L. (1984) Edgeworth Expansions of Distribution Functions of Independent Random Variables. Scientia Sinica 29 1–22.