October 5, 2014
Real data often has incorrect values in it. Origins of incorrect data include programmer errors, ("oops, we're double counting!"), surprise API changes, (a function used to return proportions, suddenly it instead returns percents), or poorly scraped data. When working with data, a desirable property of whatever you're doing is that it be robust, or continue to work in the presence of some incorrect values.
The Robust Prediction Problem
Now, the Statistics community has worked hard over the years to develop robust estimators of unknown parameters. For many problems, Statisticians have developed this so well as to be able to calculate the asymptotic efficiency loss incurred by using robust estimators instead of the theoretically optimal estimator. For example, Statisticians know how many more observations one needs to estimate the expected value of a symmetric distribution by the sample median (which is robust to erroneous data) instead of the mean, (which is not).
(For the normal distribution, for example, the median requires 1.57 (= pi/2) times as many observations as the mean to have the same asymptotic variance; in general the ratio depends on the variance of the distribution and its density at the point of symmetry, and the median can be better).
In supervised machine learning problems, though, we are interested in prediction rather than parameter estimation. One might think that robust parameter estimation translates into robust prediction; just robustly estimate the parameters in a machine learning model and then use them to do prediction, right?
This idea fails in general, however: consider the case where we robustly estimate the parameters of a logistic or linear regression. That's fine and well, but when it comes time to do prediction, we take an inner product of those parameters and the feature vector. This inner product can blow up arbitrarily if even one of the features with nonzero coefficient has an arbitrary value. In short, robust parameter estimation techniques are effective only against corruption of training data, not corruption of test data.
By their nature, tree-based models are a lot more robust; if you use a decision tree or random forest to do regression, for example, you can never predict values outside the range of the target variable in the training data. This is some comfort; at least trees will never estimate that a person is a thousand years old, (if the training data is clean, anyway). (By inspecting a fitted tree, you can also directly determine how much predictions can vary if you arbitrarily corrupt one or more features; this bound can be a lot better than the range of the response variable if the corrupted feature is relatively unimportant for prediction).
On the other hand, it's still a failure if your real-time model estimates that a teenager is ninety, that your customer's critical email is spam, or that a fraudulent order is legitimate, even if all of these predictions are within the range of the outcome variable.
So how can we make machine learning predictions robust to erroneous test data?
Naive Thresholding Ideas
One simple thing you could try would be to threshold all of the features. For example, for each (numerical) feature, figure out the 2nd and 98th percentiles in your training data, and then threshold every observation of that feature to be between those bounds, both at training time and at prediction time. This is some help at preventing the "1000-year-old" problem, though if several features are contaminated you could easily predict someone to be a few hundred years old in a linear model. Thresholding your features also offers no robustness improvement to tree-based methods; effectively they are already doing thresholding by splitting on features.
In a regression problem you could also try directly thresholding the response variable. As I argued previously, though, preventing the "1000-year-old" problem is really too weak a form of robustness since it still allows erroneous data to change the prediction throughout the range of the response variable.
Robust Features Ideas
A more promising approach to predictive robustness might be to use the median of several related features as a derived feature rather than use all the original features together. For example, Twitter could represent the popularity of a user by number of followers, average retweets per tweet, or how many times they are @mentioned per day. Each of these says something similar and yet you could imagine them all coming from different sources: the follow graph, event tables, and search indices. If you scaled them all appropriately, and then took the median of the scaled features, you'd get a robust version of a user's popularity. Then, any one of the original sources could get arbitrarily messed up and the derived popularity feature would be only minimally affected if at all.
(You need to scale the features so that taking the median "makes sense": If one of the features usually varies from 1 to 5, another from 1K to 5K, and the last from 1M to 5M, then the median of the three will always be the 1K to 5K feature, offering no robustness guarantees if it becomes corrupted. In this case, "scaling appropriately" might mean dividing the second and third features by a thousand and a million, respectively).
If you build a model based upon "median features" like this, it will have robustness properties at prediction time. Still, when you train the model you'll still need to build it in some robust way, so that incorrect values in the response variable do not mess up the estimated model. Alternatively, you can try to toss all response-variable outliers yourself. This is often feasible since at training time you can spend as much time as you'd like cleaning your data, a luxury not possible at prediction time.
The robustness achieved through these "median features" is not free, however; it comes at the cost of worse predictive accuracy. Follower counts, retweets, and @mentions are all indicative of a user's popularity, but all capture different aspects of it. With sufficient (clean) data, a model using all three measures would perform better than one using their derived robust aggregate.
Of course in general, it's a huge pain to manually define similar features, particularly if you have a lot of them. But you could imagine perhaps an automated search for clusters of features that are highly correlated, and applying the same "scale then median" trick to each cluster. Unfortunately, this technique does not guarantee that the clusters will have features from different sources; in the Twitter example, you could end up with a cluster with only features from the events tables, for example. Depending on what kind of data errors you're concerned about, this might not be as robust as you'd like.
To guarantee robustness, you could instead try randomly picking the subsets of features you apply the "scale then median" trick to. This has the advantage of theoretical robustness guarantees: If you scale then median P independent random subsets of size K from your p features, and c arbitrary features are corrupted, then the probability that none of the P robust, derived features breaks down can be relatively easily calculated in closed form. The question then is whether this approach has decent predictive accuracy.
A fundamentally different approach would be to detect and correct erroneous data. One approach could look like this: For every feature, build a few different "imputation models" to predict that feature from different subsets of other features. These imputation models should be simple and non-robust, like generalized linear models, for example. Keep track of each of these imputation models' performance.
Then, at prediction time, compare each feature's actual value to its predicted value in each of the imputation models predicting it. If the difference between the actual value and the predicted value is large in some imputation model, (compared to the imputation model's typical performance), it provides evidence that at least one of the features in the imputation model or the feature being imputed itself is erroneous. This evidence allows you to guess erroneous features at prediction time and correct them with one or more of the imputation models.
This solution has a few nice features but a lot of poor ones. On the plus side, it seems like the predictive performance of this method would be pretty good in the absence of erroneous data; most of the imputed features will be close to the true values and you'll use the true values to do prediction, incurring no loss. It also helps with the missing values problem somewhat, (although this is a huge related issue requiring its own post). And it deals with categorical features elegantly in a way that the "scale then median" ideas do not; one simply makes the imputation model a classifier in this case.
On the negative side, though, this imputation idea adds a great deal of complication to a machine learning pipeline. It would be quite slow if you have a lot of features: running each of the imputation models will multiply the time to do prediction by a considerable factor, not to mention greatly increasing the training time. And it also requires that you robustly estimate the imputation models, (or else make sure your training data is clean).
Please Solve This Problem
I hope somebody solves this problem well. If (like me) your work involves critical, real-time predictive models, then being able to trust that they won't blow up from contaminated data is part of what helps you sleep well at night.