Plans in healthcare have historically been priced in bulk, which is done to alleviate risk to individuals by putting money into pools. When predicting for this cost, it is typical to use aggregated features such as historical account cost and account demographics.
Growing capabilities to handle big and unstructured data as well as advances in modeling make individuallevel predictions increasingly feasible and an attractive alternative to modeling at the aggregate level. Individuallevel models can utilize medical data including lab values, diagnoses, prescriptions, or even doctors’ notes to predict the cost of a single patient. However, there are risks in naively aggregating individuallevel predictions to accountlevel predictions without due considerations to potentially exaggerating model error. Advantages of Individual Predictions An accountlevel view is an extreme simplification of a very complex cost relationship between patient health, hospitals, and providers. A more detailed look at patient medical claims, for example, could provide more accurate predictions of cost. Claims data consist of thousands of possible medical codes, each of which represents an event in a patient’s medical history. Patientlevel models can discover complex interactions between these codes which are predictive of cost. For this purpose, Lumiata has built a robust patient tagging process capable of translating claims data into ingestible patient timelines. We leverage this in three ways:
Concerns with Individual Predictions But with great power comes great responsibility. Approaching cost prediction at the patientlevel introduces complexity that must be tightly controlled. There are several characteristics of healthcare cost that makes this especially relevant:
BiasVariance Review For a more indepth review of the biasvariance tradeoff, I would suggest David Dalpiaz’s great online resource, R for Statistical Learning. The examples I am providing here are adapted from this resource. Assume some random vector (X,Y) with values in ℝᵖ×ℝ and define f(x) for E(YX=x). Notice that this form of f(x) minimizes the expected squared error, representing the best possible prediction we can make. Since f(x) is unknown, we approximate it with f̂(x) using some training data D, and our favorite machine learning algorithm. Note that when I refer to “algorithm,” I am referencing the method used to learn a specific model. Using these definitions and conditioning on X, observe that the expected value of prediction can be expanded into two separate components reducible error and irreducible error:
Reducible error is what we strive to *drumroll* reduce as it is a measure of our approximation of f(x) with f̂(x). Irreducible error, which is equal to V(YX=x), on the other hand, is simply not a learnable function of X and should be recognized as noise. From reducible error, we can further derive bias and variance.
Bias is a measure of the deviation of the expected form of our models and f(x). The word “expected” means that the model is a function of the underlying data that an algorithm is trained on, which is it itself a random variable. Variance on the other hand measures the expected deviance of f̂(x) from the expected fit of f̂(x).
It is always possible to have completely unbiased models with high variance by perfectly fitting the training data—but it would change significantly depending on the input, and so would generalize poorly (i.e., overfit). In order to lower the variance, the model must make certain generalizing assumptions. The more such assumptions it makes, the lower the variance—but at the cost of bias, if our assumptions turn out to be incorrect. For example, if we fit linear regression for expected true values that are not linear in the features, then that’s a bad assumption that leads to bias; it does, however, decrease variance. Visualizing Bias and Variance To demonstrate the biasvariance tradeoff, I repeatedly fit polynomial models to simulated data (normally distributed random points with mean x²) as defined in the following code snippet.
Below I used 3 algorithms representing a biased algorithm (k=1), an unbiased and low variance algorithm (k=2), and an unbiased highvariance algorithm (k=10):
In the above plot f̂₁ seems quite consistent with varying training data, even if it is missing the true form of f (defined above as x²). In contrast observe that f̂₁₀, while following the trend of the data, seems to vary a large deal from simulation to simulation. There are several things to notice here: for the algorithm with bias, f̂₁, we see that the distribution of fit is not centered around 0; however, it is relatively tight. As k increases, bias is reduced but the dispersion of fit also increases. Additionally, notice that even for the wellconditioned algorithm, there is inherent randomness in the fitting process. The Risks of Aggregation What happens to error with aggregated models? A naive approach to predictions at the group level is to train a patientlevel model, and then for each group set the aggregate prediction to the sum of patient predictions for every group member. The question then arises: does this procedure optimize for the group level error? As it turns out, the answer is no! To see this, consider the following heuristic. Let a group of size N consist of individuals with feature values {x₁,...,x_{N}}, and true costs {y₁,...,y_{N}}, and further assume that costs are independent across patients. For the sake of simplicity, assume that x₁= … =x_{N} :=x. (If naive aggregation fails with even this assumption, there is no hope in general.) Then:
where f(x):=E(YX=x), and the εᵢ's are independent identically distributed variables with mean 0.
As before, the patientlevel error for X=x has a decomposition:
where V(εᵢ) = σ² for all i.
Let’s look now at the grouplevel error:
because of the linearity of expectation, and that fact that variance of a sum of independent variables is the sum of the variances of those variables.
Thus the error of the aggregation is not a linear multiple of the sum of the errors of the individual predictions! At the aggregated level, reducible error is multiplied by the square of N (in fact, both bias and variance are multiplied by the square of N), but the irreducible error is multiplied only by N. Discussion In the above analysis we can notice, somewhat trivially, that bias compounds. Someone without much experience in data science might think that we may underpredict a little here and overpredict a little there…but hey! it all adds up to a null sum! This brings to mind the old joke “We lose money on every sale, but make it up on volume!” That is, when you add up many small losses you get a big loss! In healthcare, we are even more likely to compound our model error because accounts tend to be more homogenous than the general population; that is, members within accounts are more similar to one another than they are to members in the general population. For example, consider a logging company (the most accident prone job in America). In this case members have higher expected cost than the general population. Since our predicted cost f̂(x) is less than the expected cost f(x), we have introduced bias. Furthermore, since we are predicting in aggregate for every member of that account, we will be compounding our bias for each prediction! We can generalize this to any account that has characteristics that are not reflected in the general population. Making Healthcare Smarter Optimizing patient predictions for aggregate performance is an area of active research at Lumiata. Below are some strategies we employ when predicting at the grouplevel:
Following this roadmap we can leverage patientlevel data while avoiding the pitfalls that would accompany naive aggregation. If you are interested in building and scaling cool models with healthcare data, Lumiata is hiring!
Matt McClelland
Data Scientist, Lumiata 
