## Introduction

The goal of this post is to present some of my thoughts about a very common, yet scantily addressed, problem in machine learning at large (and healthcare in particular): how do you construct group-level predictions given only individual-level data in a way that optimizes a pre-determined loss function (e.g., MAE or MSE)?

If you first create an individual-level model optimizing for some loss function and then take the average of the predictions, do you automatically optimize for the same loss function at the group level? As it turns out, while the answer happens to be “yes” in the case of MSE, the answer is a resounding “NO” in the case of MAE!

In this post I explain why the knee-jerk method of using the same loss function at the individual level is The Wrong Thing To Do.

This is a story about uncovering the real nature of this problem.

As part of this exploration I use an interesting theorem by Peter Hall whose proof I found to be somewhat difficult to penetrate. As a service to the community, in Part 2 I will present the backstory for the theorem, and provide a slightly more natural proof than the one he provided in his paper, with all of the crucial details that were left out of the original paper filled in.

All code for plots seen in this post can be found in this GitHub repo.

## Example and Assumptions

In order to make this exploration easier to follow, I will use a concrete example and fix my assumptions.

*Example*: A health insurance company wants to estimate the per-person cost of employer groups (i.e., the total cost of the group divided by the number of members) over the span of the next year, where each employer group consists of some number of individual members.*Base Assumptions*: You are given training and test data that consist of individual-level data as well as a member-to-group map. The training data has target values represented by each individual’s incurred cost in the next year, while the test data has no target values. The test data and the training data potentially have different groups.

It only remains to specify what “true values” to compare against at the group level. In order to show why optimizing for the same loss function at the individual level that you do at the group level is misguided, it would be easiest using the following assumption:

*Assumption A*: For each group, set its true value to be the average target value of its members.

We will later also consider another assumption that arises more naturally if you allow for membership to vary.

## Why have an individual model at all?

One way to optimize a particular loss function at the group level is simply to create a group-level model that optimizes this loss function, whose features are engineered aggregated individual-level features. But the number of groups is presumably much smaller than the number of individuals, so one would expect that such a model would suffer from high variance. A combined individual-level and group-level approach would be wisest.

“Why not use Recurrent Neural Networks? That way you can make a group-level model that uses all the data!”, I hear you cry. Well… That would be using all the individual-level features, but not the unaggregated target values, unless a special loss function is employed. And even then, RNN’s are order-dependent, and group membership is not! So let’s go ahead and call that “experimental”.

Either way, your group-level predictions can only gain from doing an individual-level model well. So let’s get to it!

## Intuition about optimizing for MAE/MSE

The bottom line is: optimizing for MSE means you are estimating the mean; optimizing for MAE means you are estimating the median.

What does that actually mean? Let ** Y** be the target value, and let

**be the features. If your feature values are**

*X_1,…,X_n***, then the target value given those features**

*X_1=x_1,…,X_n=x_n***is a random variable, not a constant. In other words, your feature values don’t determine the target. For example, if you are predicting cost, then it is perfectly conceivable that two individuals with the same feature values have different costs; though knowing these feature values does change the distribution of cost.**

*Y|X_1=x_1,…,X_n=x_n*If you are optimizing for MSE, then plugging in ** (x_1,…,x_n)** into your model will attempt to predict

**; whereas if you are optimizing for MAE your model will attempt to predict**

*E(Y|X_1=x_1,…,X_n=x_n)***.**

*median(Y|X_1=x_1,…,X_n=x_n)*Indeed, if ** f(x_1,…,x_n) **is the model’s prediction at these feature values for a machine learning model whose loss function is MSE, then it would attempt to approximate the

**that minimizes:**

*a*This last expression is a parabola in ** a** with global minimum

**.**

*E(Y|X_1=x_1,…,X_n=x_n)*Similarly, if ** f(x_1,…,x_n) **is the model’s prediction at these feature values for a machine learning model whose loss function is MAE, then it would attempt to approximate the

**that minimizes:**

*a*The ** a** that minimizes this expression is

**. (To see this you have to play around with integrals; it’s an easy but annoying exercise.)**

*median(Y|X_1=x_1,…,X_n=x_n)*Technical note: Conditional probability of a random variable given an event only makes sense if the event that you’re conditioning on has positive probability. Attempts to extend the definition to events with zero probability are doomed to be ill-defined unless we specify a limiting procedure; see the Borel–Kolmogorov paradox. If at least one of the ** X_i**’s is continuous, then

**, which would imply that**

*P(X_1=x_1,…,X_n=x_n)=0***doesn’t make any sense. (The definition commonly found in undergraduate textbooks for**

*Y|X_1=x_1,…,X_n=x_n***is not coordinate invariant. For the more advanced readers: conditioning on sub-sigma-algebras rather than events yields the same issue, since the resulting random variable is unique up to “almost sure equality”, which means probability zero events can be exceptions.) We can and will elegantly avoid this issue by restricting the values of the features to values that a computer can represent. In this way, even “continuous” variables are actually discrete, and everything is well-defined.**

*Y|X_1=x_1,…,X_n=x_n*## The problem

Fix a group of size ** m**, and let

be the features of the ** i**-th person. If we are creating an individual-level model that optimizes MSE, then the average of the results is an estimate of

By the linearity of expectation, this is equal to

Great! In other words, by optimizing MSE at the individual level, you’re optimizing MSE at the group level!

Now let’s do the same analysis for MAE. If we are creating an individual-level model that optimizes MAE, then the aggregation of the results is an estimate of

But that is very, very, veeeery far from being

Here’s a quick illustration that the median of the sum is very far from being the sum of the medians, even if the random variables are all i.i.d’s:

In order to set up this example, I will be using a Gamma distribution with scale ** 200** and shape

**. This is what this distribution looks like:**

*1*(Code for plots seen in this post can be found in this GitHub repo.)

Now take ** X_i**’s to be i.i.d’s following this distribution. Then here is a comparison of the average of their medians versus the median of their average:

As you can see, this gap is quite large!

## Optimizing MSE at the individual level is superior to optimizing MAE at the individual level for group-level MAE

If a group is large then it is somewhat reasonable to assume that

is approximately normal; and the median of a normal distribution is its mean. So if you optimize for MSE at the individual level and then take the average of the predictions, you’ll be approximately estimating the median for large groups! This might seem counter-intuitive at first: optimizing MSE at the individual level is superior to optimizing MAE at the individual level not only for group-level MSE, but also for group-level MAE.

But how bad should we expect the bias to be for small groups? (As it will turn out, secretly the key is to use estimates of the error of the Central Limit Theorem.)

## A result by Peter Hall

This piqued my interest. So I started looking for results about the median of a sum of i.i.d’s. I found “On the Limiting Behaviour of the Mode and Median of a Sum of Independent Variables” by Peter Hall¹, where he proved the following result.

(See Part 2 of this post for a deeper understanding of why this theorem is true.) Since the ** X_i**’s are all identically distributed, let’s simply let

**for notational ease. If we do not assume that the**

*X := X_1***’s have mean**

*X_i***and variance**

*0***, a quick back of the envelope computation, reducing to the normalized case, shows that this reduces to the following approximation:**

*1*Notice that this is an asymptotic result! For us to be able to use it in the machine learning context discussed above, we must first make sure that this approximation is reasonable for small ** n**. So let’s do a quick proof-of-concept:

(Code for plots seen in this post can be found in this GitHub repo.)

That’s pretty accurate!

## Bias estimation in the machine learning setting

For each group we would be attempting to estimate

for that group. The summands are not i.i.d’s, and so Hall’s result is not directly applicable. To that end, we can try to replace Assumption A with:

*Assumption B:*For each group, assume that its “true value” is the expected value of the mean of**m**samples, with repetition, of the true values of its members*.*

(Arguably, Assumption B comes up more often in real life. Often in group level predictions the members don’t stay fixed, but the current members’ target values are indicative of the distribution target values of future members.)

Now that the target value is the mean of i.i.d’s (because the sampling is done with repetition), we may employ Hall’s approximation. Fix a group, and let ** T_1, …, T_m** be the sampling, with repetition, mentioned in Assumption B. The

**’s are now i.i.d’s, and so for the sake of clarity let**

*T_i***. We are now faced with the challenge of approximating**

*T:=T_1***and**

*V(T)***for each group. This can be difficult, given that the groups that we want the most to bias-correct are small groups.**

*E((T-μ)³)*One simplistic approach is to estimate

**as**

*V(T)***, and estimate**

*V(Y)***as**

*E((T-μ)³)***. In turn, estimate**

*E((Y-E(Y))³)***and**

*V(Y)***by taking the average values over the target values in our data, and denote these estimates as**

*E((Y-E(Y))³)***and**

*Vˆ***respectively. Whether these estimates work well or not depends on how different the groups are from one another — the more similar they are, the better this bias correction will be. Either way, these approximations are good enough to get a relatively clear picture of how small a group needs to be so that the bias is beyond your level of comfort. Namely, a rough estimate is that if the size of a group is**

*κˆ_3***, then averaging an individual model trained using MSE should have a bias of about**

*m*For groups for which this number is intolerably big, I would suggest extra care: ** Vˆ **and

**may not be good enough estimates, and may lead to a bad bias correction. I would simply suggest to flag groups of this size as requiring a bias correction, and do more research on the bias correction needed that is informed by the data and the particular application you have in mind.**

*κˆ_3*(Notice that while it is tempting to use Bayesian approaches that will allow you to sample from the distributions of the

**’s, the vanilla version of these approaches assumes that the**

*Y|X_1=x_{i, 1},…,X_n=x_{i, n}***’s are normally distributed. If that were the case, there wouldn’t be any bias to correct for… This holds also for using dropout at prediction time in a neural network, which, by the work of Yarin Gal and Zoubin Ghahramani², is roughly equivalent to sampling the posterior in an appropriately defined Gaussian process setting — but that setting also assumes normally distributed error.)**

*Y|X_1=x_{i, 1},…,X_n=x_{i, n}*If this post got your interest and you would like to challenge yourself with similar problems, we are hiring! Please check our open positions.

GitHub: https://github.com/lumiata/tech_blog

Visit Lumiata at www.lumiata.com and follow on Twitter via @lumiata.

Find us on LinkedIn: www.linkedin.com/company/lumiata

*Citations**:*

*1. Hall, P. (1980) On the limiting behaviour of the mode and median of a sum of independent random variables. Ann. Probability 8 419–430.*

*2. Gal, Y. and Ghahramani, Z. (2016) Dropout as a Bayesian approximation: Representing model uncertainty in deep learning.ICML.*