Summary: Will Automated Predictive Analytics be a boon to professional data scientists or a dangerous diversion allowing well-meaning, motivated but amateur users try to implement predictive analytics. More on the conversation started last week about new One-Click Data-In Model-Out platforms.
I have always been very much of the view that data science is best left to data scientists. But in the research that led up to my article last week “Data Scientists Automated and Unemployed by 2025!” I detailed what I’d found about a new trend and a group of analytic platform developers who are driven to deliver One-Click Data-In Model-Out functionality. In other words, fully Automated Predictive Analytics.
In that article we used the popular definition: Automated Predictive Analytics are services that allow a data owner to upload data and rapidly build predictive or descriptive models with a
There is an element here of trying to make the data scientist more efficient by automating the tasks that are least creative. Much of that is in data cleansing, normalizing, removing skewness, transforming data for specific algorithm requirements, and even running multiple algorithms in parallel to determine champion models.
But the reality is that these companies are pitching directly to non-data scientist with tag lines like “Data Science Isn’t Rocket Science”. And the reason they are headed in this direction is that Gartner predicts that this ‘Citizen Data Scientist’ market will grow 5X more quickly than the true data scientist market.
It’s also a compelling motive that while most data scientist have settled in on the one or two advanced analytic platforms or tools they prefer to use that this new expanding market does not come with that baggage. To sell here will not require displacing SAS or SPSS or any of the other heavy hitters since those tools don’t meet the needs of these new users.
What Could Possible Go Wrong
With these concerns in mind I set out to catalogue all the things that can go wrong when excited amateurs use sophisticated tools to produce predictive analytics. I also went back and talked in greater depth to a few of the five One-Click companies I listed in the earlier article to see if they acknowledged or were addressing these concerns. The list of the five companies is not intended to be exhaustive and neither were my repeat interviews.
A fast way to sub-optimize a model is to start out assuming the data is homogenous. Data Science 101 says explore the data visually (to avoid the rooky mistakes illustrated by Anscomb’s Quartet) and lead with clustering and segmentation so you don’t mistake whole new markets for simple outliers.
Some of the One-Clicks have recognized this. DataRPM leads with segmentation and clustering. PurePredictive which just launched says this is on their development roadmap.
Limiting this to ensuring that all the data in a column is consistently numeric (no extraneous symbols such as commas) or categorical, (ensuring a limited and consistent use of alphas), we may actually be better off with an automated system that is less likely to make human errors. Like spell check on your phone however, you’ll want to inspect the assumptions especially when automatically trying to standardize alphas.
So long as we are talking about things like removing skewness, or normalizing data required for specific algorithms (e.g. normalizing for neural nets) once again, I think I would be likely to accept an automated system.
The One-Click folks are smart enough to impute missing data only where it’s needed by the algorithm (e.g. not for trees). Beyond that there are a variety of techniques at work that vary from platform to platform that a reviewer would want to pay attention to.
Techniques range from simple median value replacement to AI-based logic. Also you would want to carefully examine how much missing data triggers a rule to simply ignore the feature.