forrmula1

Automated Predictive Analytics – What Could Possibly Go Wrong?

Automated Predictive Analytics – What Could Possibly Go Wrong?

Summary:  Will Automated Predictive Analytics be a boon to professional data scientists or a dangerous diversion allowing well-meaning, motivated but amateur users try to implement predictive analytics.  More on the conversation started last week about new One-Click Data-In Model-Out platforms.

I have always been very much of the view that data science is best left to data scientists.  But in the research that led up to my article last week “Data Scientists Automated and Unemployed by 2025!” I detailed what I’d found about a new trend and a group of analytic platform developers who are driven to deliver One-Click Data-In Model-Out functionality.  In other words, fully Automated Predictive Analytics.

In that article we used the popular definition:  Automated Predictive Analytics are services that allow a data owner to upload data and rapidly build predictive or descriptive models with a

There is an element here of trying to make the data scientist more efficient by automating the tasks that are least creative.  Much of that is in data cleansing, normalizing, removing skewness, transforming data for specific algorithm requirements, and even running multiple algorithms in parallel to determine champion models.

But the reality is that these companies are pitching directly to non-data scientist with tag lines like “Data Science Isn’t Rocket Science”.  And the reason they are headed in this direction is that Gartner predicts that this ‘Citizen Data Scientist’ market will grow 5X more quickly than the true data scientist market.

Read Also:
Uber to provide ‘anonymized’ data to city officials

It’s also a compelling motive that while most data scientist have settled in on the one or two advanced analytic platforms or tools they prefer to use that this new expanding market does not come with that baggage.  To sell here will not require displacing SAS or SPSS or any of the other heavy hitters since those tools don’t meet the needs of these new users.

What Could Possible Go Wrong

With these concerns in mind I set out to catalogue all the things that can go wrong when excited amateurs use sophisticated tools to produce predictive analytics.  I also went back and talked in greater depth to a few of the five One-Click companies I listed in the earlier article to see if they acknowledged or were addressing these concerns. The list of the five companies is not intended to be exhaustive and neither were my repeat interviews.

A fast way to sub-optimize a model is to start out assuming the data is homogenous.  Data Science 101 says explore the data visually (to avoid the rooky mistakes illustrated by Anscomb’s Quartet) and lead with clustering and segmentation so you don’t mistake whole new markets for simple outliers.

Read Also:
Never judge a book by its cover … leave that to the deep neural networks

Some of the One-Clicks have recognized this.  DataRPM leads with segmentation and clustering.  PurePredictive which just launched says this is on their development roadmap.

Limiting this to ensuring that all the data in a column is consistently numeric (no extraneous symbols such as commas) or categorical, (ensuring a limited and consistent use of alphas), we may actually be better off with an automated system that is less likely to make human errors.  Like spell check on your phone however, you’ll want to inspect the assumptions especially when automatically trying to standardize alphas.

So long as we are talking about things like removing skewness, or normalizing data required for specific algorithms (e.g. normalizing for neural nets) once again, I think I would be likely to accept an automated system.

The One-Click folks are smart enough to impute missing data only where it’s needed by the algorithm (e.g. not for trees).  Beyond that there are a variety of techniques at work that vary from platform to platform that a reviewer would want to pay attention to.

Read Also:
Microsoft brings AI to healthcare with new partner program

Techniques range from simple median value replacement to AI-based logic.  Also you would want to carefully examine how much missing data triggers a rule to simply ignore the feature.



Data Innovation Summit 2017

30
Mar
2017
Data Innovation Summit 2017

30% off with code 7wData

Read Also:
Can Data Science Predict Who Will Get Michelin Stars?

Big Data Innovation Summit London

30
Mar
2017
Big Data Innovation Summit London

$200 off with code DATA200

Read Also:
Volume, velocity, and variety: Understanding the three V's of big data

Enterprise Data World 2017

2
Apr
2017
Enterprise Data World 2017

$200 off with code 7WDATA

Read Also:
Data Analytics: How Your New Career is Changing Everything

Data Visualisation Summit San Francisco

19
Apr
2017
Data Visualisation Summit San Francisco

$200 off with code DATA200

Read Also:
5 ways AI will impact the global business market in 2017

Chief Analytics Officer Europe

25
Apr
2017
Chief Analytics Officer Europe

15% off with code 7WDCAO17

Read Also:
A Complete Tutorial to learn Data Science in R from Scratch

Leave a Reply

Your email address will not be published. Required fields are marked *