This post presents some common scenarios where a seemingly good machine learning model may still be wrong, along with a discussion of how how to evaluate these issues by assessing metrics of bias vs. variance and precision vs. recall.

There are a number of machine learning models to choose from. We can use Linear Regression to predict a value, Logistic Regression to classify distinct outcomes, and Neural Networks to model non-linear behaviors.

When we build these models, we always use a set of historical data to help our machine learning algorithms learn what is the relationship between a set of input features to a predicted output. But even if this model can accurately predict a value from historical data, how do we know it will work as well on new data?

Or more plainly, how do we evaluate whether a machine learning model is actually “good”?

In this post we’ll walk through some common scenarios where a seemingly good machine learning model may still be wrong. We’ll show how you can evaluate these issues by assessing metrics of bias vs. variance and precision vs. recall, and present some solutions that can help when you encounter such scenarios.

When evaluating a machine learning model, one of the first things you want to assess is whether you have “High Bias” or “High Variance”.

High Bias refers to a scenario where your model is “underfitting” your example dataset (see figure above). This is bad because your model is not presenting a very accurate or representative picture of the relationship between your inputs and predicted output, and is often outputting high error (e.g. the difference between the model’s predicted value and actual value).

High Variance represents the opposite scenario. In cases of High Variance or “overfitting”, your machine learning model is so accurate that it is perfectly fitted to your example dataset. While this may seem like a good outcome, it is also a cause for concern, as such models often fail to generalize to future datasets. So while your model works well for your existing data, you don’t know how well it’ll perform on other examples.

But how can you know whether your model has High Bias or High Variance?

One straightforward method is to do a Train-Test Split of your data. For instance, train your model on 70% of your data, and then measure its error rate on the remaining 30% of data. If your model has high error in both the train and test datasets, you know your model is underfitting both sets and has High Bias. If your model has low error in the training set but high error in the test set, this is indicative of High Variance as your model has failed to generalize to the second set of data.

If you can generate a model with overall low error in both your train (past) and test (future) datasets, you’ll have found a model that is “Just Right” and balanced the right levels of bias and variance.

Even when you have high accuracy, it’s possible that your machine learning model may be susceptible to other types of error.

Take the case of classifying email as spam (the positive class) or not spam (the negative class). 99% of the time, the email you receive is not spam, but perhaps 1% of the time it is spam. If we were to train a machine learning model and it learned to always predict an email as not spam (negative class), then it would be accurate 99% of the time despite never catching the positive class.