forrmula1

Automated Predictive Analytics – What Could Possibly Go Wrong?

Automated Predictive Analytics – What Could Possibly Go Wrong?

Summary:  Will Automated predictive analytics be a boon to professional data scientists or a dangerous diversion allowing well-meaning, motivated but amateur users try to implement predictive analytics.  More on the conversation started last week about new One-Click Data-In Model-Out platforms.

I have always been very much of the view that data science is best left to data scientists.  But in the research that led up to my article last week “Data Scientists Automated and Unemployed by 2025!” I detailed what I’d found about a new trend and a group of analytic platform developers who are driven to deliver One-Click Data-In Model-Out functionality.  In other words, fully Automated Predictive Analytics.

In that article we used the popular definition:  Automated Predictive Analytics are services that allow a data owner to upload data and rapidly build predictive or descriptive models with a

There is an element here of trying to make the data scientist more efficient by automating the tasks that are least creative.  Much of that is in data cleansing, normalizing, removing skewness, transforming data for specific algorithm requirements, and even running multiple algorithms in parallel to determine champion models.

But the reality is that these companies are pitching directly to non-data scientist with tag lines like “data science Isn’t Rocket Science”.  And the reason they are headed in this direction is that Gartner predicts that this ‘Citizen Data Scientist’ market will grow 5X more quickly than the true data scientist market.

Read Also:
Machine learning can bring more intelligence to radiology

It’s also a compelling motive that while most data scientist have settled in on the one or two advanced analytic platforms or tools they prefer to use that this new expanding market does not come with that baggage.  To sell here will not require displacing SAS or SPSS or any of the other heavy hitters since those tools don’t meet the needs of these new users.

What Could Possible Go Wrong

With these concerns in mind I set out to catalogue all the things that can go wrong when excited amateurs use sophisticated tools to produce predictive analytics.  I also went back and talked in greater depth to a few of the five One-Click companies I listed in the earlier article to see if they acknowledged or were addressing these concerns. The list of the five companies is not intended to be exhaustive and neither were my repeat interviews.

A fast way to sub-optimize a model is to start out assuming the data is homogenous.  Data Science 101 says explore the data visually (to avoid the rooky mistakes illustrated by Anscomb’s Quartet) and lead with clustering and segmentation so you don’t mistake whole new markets for simple outliers.

Read Also:
How to Simplify Your BI in the Age of Data Complexity

Some of the One-Clicks have recognized this.  DataRPM leads with segmentation and clustering.  PurePredictive which just launched says this is on their development roadmap.

Limiting this to ensuring that all the data in a column is consistently numeric (no extraneous symbols such as commas) or categorical, (ensuring a limited and consistent use of alphas), we may actually be better off with an automated system that is less likely to make human errors.  Like spell check on your phone however, you’ll want to inspect the assumptions especially when automatically trying to standardize alphas.

So long as we are talking about things like removing skewness, or normalizing data required for specific algorithms (e.g. normalizing for neural nets) once again, I think I would be likely to accept an automated system.

The One-Click folks are smart enough to impute missing data only where it’s needed by the algorithm (e.g. not for trees).  Beyond that there are a variety of techniques at work that vary from platform to platform that a reviewer would want to pay attention to.

Read Also:
Why some Data Lakes are built to last

Techniques range from simple median value replacement to AI-based logic.  Also you would want to carefully examine how much missing data triggers a rule to simply ignore the feature.



Data Science Congress 2017

5
Jun
2017
Data Science Congress 2017

20% off with code 7wdata_DSC2017

Read Also:
Machine learning can bring more intelligence to radiology

AI Paris

6
Jun
2017
AI Paris

20% off with code AIP17-7WDATA-20

Read Also:
4 Ways Data-First Competitors Are Killing You

Chief Data Officer Summit San Francisco

7
Jun
2017
Chief Data Officer Summit San Francisco

$200 off with code DATA200

Read Also:
The CIO Imperative: Customer Experience, Agility and Scale

Customer Analytics Innovation Summit Chicago

7
Jun
2017
Customer Analytics Innovation Summit Chicago

$200 off with code DATA200

Read Also:
​25 Big Data Terms You Must Know To Impress Your Date (Or whoever you want to)

HR & Workforce Analytics Innovation Summit 2017 London

12
Jun
2017
HR & Workforce Analytics Innovation Summit 2017 London

$200 off with code DATA200

Read Also:
4 Ways Data-First Competitors Are Killing You

Leave a Reply

Your email address will not be published. Required fields are marked *