7 Cases Where Big Data Isn’t Better Blog

7 Cases Where Big Data Isn’t Better

by 7wData
October 24, 2016

Summary: It’s become almost part of our culture to believe that more data, particularly Big Data quantities of data will result in better models and therefore better business value. The problem is it’s just not always true. Here are 7 cases that make the point.

Following the literature and the technology you would think there is universal agreement that more data means better models. With the explosion of in-memory analytics, Big Data quantities of data can now realistically be processed to produce a variety of different predictive models and a lot of folks are shouting BIGGER IS BETTER.

However, every time a meme like this starts to pick up steam it’s always a good idea to step back and examine the premise. Is it universally true that our models will be more accurate if we use more data? As a data scientist you will want to question this assumption and not automatically reach for that brand new high-performance in-memory modeling array before examining some of these issues.

Let’s start the conversation this way. In general I would always prefer to have more data than less but there are some legitimate considerations and limitations to this maxim. I’d like to lift the kimono on several of them. Spoiler alert – there’s not one answer here. You have to use your best professional judgement and really think about it.

1. How Much Accuracy Does More Data Add

Time and compute speeds are legitimate constraints on how much data we use to model. If you are fortunate enough to have both tons of data and one of the new in-memory analytic platforms then you don’t suffer this constraint. However, the great majority of us don’t yet have access to these platforms. Does this mean that our traditional methods of sampling data are doomed to create suboptimal models no matter how sophisticated our techniques?

The answer is clearly no, with some important qualifiers that we’ll discuss further on. If we go back into the 90’s when computational speed was much more severely restricted, you could find academic studies that supported between 15 and 30 observations for each feature (e.g., Pedhazur, 1997, p. 207). Today we might regard this as laughably small but these academic studies supported that models built with this criteria could generalize well.

I’ve always been a supporter of the fact that better models mean better business value. Quite small increases in fitness scores can leverage up into much larger percentage increases in campaign ROI. But there are studies that suggest that in (relatively normal) consumer behavior scoring models that anything over 15,000 rows doesn’t return much in additional accuracy.

If 15,000 observations seems trivial to you, then make it 50,000 and do your own experiments. The reductio-ad-absurdum version of this would be to choose 10 million observations if you already had 1 million to work with. At some point any additional accuracy disappears into distant decimal points. Sampling large data sets, in many if not most cases, still works fine.

If you’ve got a dataset with lots of missing data should we favor much larger datasets? Not necessarily. The first question to be answered is whether the features with the high missing-data rates are predictive. If they are there are still many techniques for estimating the missing data that are perfectly effective and don’t warrant using multi-gigabyte size datasets for modeling. Once again, the question is how many decimal points out will any enhancement in result be and does that warrant the additional compute time or investment in in-memory analytic capability.

3. The Events I’m Modeling are Rare (Like Fraud)

Some events like fraud are so costly to an organization that a very large expenditure in energy to detect and prevent it can be justified. Also, because these events are rare, sometimes fractions of 1% of your dataset that the concern is that sampling might miss them.

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

7 Cases Where Big Data Isn’t Better

Leave a Reply Cancel reply

Upcoming Events

MarkLogic World | Amsterdam

Knowledge Graph — The Ultimate Center of Excellence

From Text to Value: Pairing Text Analytics and Generative AI

Bringing Data Closer to Decision Makers with Data Fabric

Categories

Tags

You Might Be Interested In

Leveraging AI Capabilities for Innovating Brand Storytelling

Chief Information Officers Take Center Stage on ESG Strategies

How the Economics of Data Science is Creating New Sources of Value

Recent Jobs

IT Engineer

Data Engineer

Applications Developer

D365 Business Analyst

Do You Want to Share Your Story?

Join our community

Our Services

Company

Work With Us

Follow Us

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.

7 Cases Where Big Data Isn’t Better

Leave a Reply Cancel reply

Upcoming Events

Categories

Tags

You Might Be Interested In

Recent Jobs

Do You Want to Share Your Story?

Join our community

Our Services

Company

Work With Us

Follow Us

Get the 3 STEPS

To Drive Analytics Adoption And manage change

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.

To Drive Analytics Adoption
And manage change