7 Cases Where Big Data Isn’t Better

7 Cases Where Big Data Isn’t Better

Summary:  It’s become almost part of our culture to believe that more data, particularly Big Data quantities of data will result in better models and therefore better business value.  The problem is it’s just not always true.  Here are 7 cases that make the point.

Following the literature and the technology you would think there is universal agreement that more data means better models.  With the explosion of in-memory analytics, Big Data quantities of data can now realistically be processed to produce a variety of different predictive models and a lot of folks are shouting BIGGER IS BETTER.

However, every time a meme like this starts to pick up steam it’s always a good idea to step back and examine the premise.  Is it universally true that our models will be more accurate if we use more data?  As a data scientist you will want to question this assumption and not automatically reach for that brand new high-performance in-memory modeling array before examining some of these issues.

Let’s start the conversation this way.  In general I would always prefer to have more data than less but there are some legitimate considerations and limitations to this maxim.  I’d like to lift the kimono on several of them.  Spoiler alert – there’s not one answer here.  You have to use your best professional judgement and really think about it.

 1.       How Much Accuracy Does More Data Add

Time and compute speeds are legitimate constraints on how much data we use to model.  If you are fortunate enough to have both tons of data and one of the new in-memory analytic platforms then you don’t suffer this constraint.  However, the great majority of us don’t yet have access to these platforms.  Does this mean that our traditional methods of sampling data are doomed to create suboptimal models no matter how sophisticated our techniques?

The answer is clearly no, with some important qualifiers that we’ll discuss further on.  If we go back into the 90’s when computational speed was much more severely restricted, you could find academic studies that supported between 15 and 30 observations for each feature (e.g., Pedhazur, 1997, p. 207).  Today we might regard this as laughably small but these academic studies supported that models built with this criteria could generalize well.

I’ve always been a supporter of the fact that better models mean better business value.  Quite small increases in fitness scores can leverage up into much larger percentage increases in campaign ROI.  But there are studies that suggest that in (relatively normal) consumer behavior scoring models that anything over 15,000 rows doesn’t return much in additional accuracy.

If 15,000 observations seems trivial to you, then make it 50,000 and do your own experiments.  The reductio-ad-absurdum version of this would be to choose 10 million observations if you already had 1 million to work with.  At some point any additional accuracy disappears into distant decimal points.  Sampling large data sets, in many if not most cases, still works fine.

If you’ve got a dataset with lots of missing data should we favor much larger datasets?  Not necessarily.  The first question to be answered is whether the features with the high missing-data rates are predictive.  If they are there are still many techniques for estimating the missing data that are perfectly effective and don’t warrant using multi-gigabyte size datasets for modeling.  Once again, the question is how many decimal points out will any enhancement in result be and does that warrant the additional compute time or investment in in-memory analytic capability.

 3.       The Events I’m Modeling are Rare (Like Fraud)

Some events like fraud are so costly to an organization that a very large expenditure in energy to detect and prevent it can be justified.  Also, because these events are rare, sometimes fractions of 1% of your dataset that the concern is that sampling might miss them.

 

Share it:
Share it:

[Social9_Share class=”s9-widget-wrapper”]

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

You Might Be Interested In

Leveraging AI Capabilities for Innovating Brand Storytelling

15 Jun, 2020

According to recent research by IDC, the total global spending on Artificial Intelligence(AI) systems will touch US$98 billion by 2023. …

Read more

Chief Information Officers Take Center Stage on ESG Strategies

1 Jun, 2022

As environmental, social and governance (ESG) strategies climb the list of priorities at organizations worldwide, chief information officers find themselves …

Read more

How the Economics of Data Science is Creating New Sources of Value

27 Apr, 2022

There are several technology and business forces in-play that are going to derive and drive new sources of customer, product …

Read more

Recent Jobs

IT Engineer

Washington D.C., DC, USA

1 May, 2024

Read More

Data Engineer

Washington D.C., DC, USA

1 May, 2024

Read More

Applications Developer

Washington D.C., DC, USA

1 May, 2024

Read More

D365 Business Analyst

South Bend, IN, USA

22 Apr, 2024

Read More

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

3-steps-to-drive-analytics-adoption

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.