Summary: It’s become almost part of our culture to believe that more data, particularly Big Data quantities of data will result in better models and therefore better business value. The problem is it’s just not always true. Here are 7 cases that make the point.
Following the literature and the technology you would think there is universal agreement that more data means better models. With the explosion of in-memory analytics, Big Data quantities of data can now realistically be processed to produce a variety of different predictive models and a lot of folks are shouting BIGGER IS BETTER.
However, every time a meme like this starts to pick up steam it’s always a good idea to step back and examine the premise. Is it universally true that our models will be more accurate if we use more data? As a data scientist you will want to question this assumption and not automatically reach for that brand new high-performance in-memory modeling array before examining some of these issues.
Let’s start the conversation this way. In general I would always prefer to have more data than less but there are some legitimate considerations and limitations to this maxim. I’d like to lift the kimono on several of them. Spoiler alert – there’s not one answer here. You have to use your best professional judgement and really think about it.
1. How Much Accuracy Does More Data Add
Time and compute speeds are legitimate constraints on how much data we use to model. If you are fortunate enough to have both tons of data and one of the new in-memory analytic platforms then you don’t suffer this constraint. However, the great majority of us don’t yet have access to these platforms. Does this mean that our traditional methods of sampling data are doomed to create suboptimal models no matter how sophisticated our techniques?
The answer is clearly no, with some important qualifiers that we’ll discuss further on. If we go back into the 90’s when computational speed was much more severely restricted, you could find academic studies that supported between 15 and 30 observations for each feature (e.g., Pedhazur, 1997, p. 207). Today we might regard this as laughably small but these academic studies supported that models built with this criteria could generalize well.
I’ve always been a supporter of the fact that better models mean better business value. Quite small increases in fitness scores can leverage up into much larger percentage increases in campaign ROI. But there are studies that suggest that in (relatively normal) consumer behavior scoring models that anything over 15,000 rows doesn’t return much in additional accuracy.
If 15,000 observations seems trivial to you, then make it 50,000 and do your own experiments. The reductio-ad-absurdum version of this would be to choose 10 million observations if you already had 1 million to work with. At some point any additional accuracy disappears into distant decimal points. Sampling large data sets, in many if not most cases, still works fine.
If you’ve got a dataset with lots of missing data should we favor much larger datasets? Not necessarily. The first question to be answered is whether the features with the high missing-data rates are predictive. If they are there are still many techniques for estimating the missing data that are perfectly effective and don’t warrant using multi-gigabyte size datasets for modeling. Once again, the question is how many decimal points out will any enhancement in result be and does that warrant the additional compute time or investment in in-memory analytic capability.
3. The Events I’m Modeling are Rare (Like Fraud)
Some events like fraud are so costly to an organization that a very large expenditure in energy to detect and prevent it can be justified. Also, because these events are rare, sometimes fractions of 1% of your dataset that the concern is that sampling might miss them.