Data science sits at the core of any analytical exercise conducted on a big data or Internet of Things (IoT) environment. Data science involves a wide array of technologies, business, and machine-learning algorithms. The purpose of data science is not only to do machine learning or statistical analysis, but also to derive insights out of the data that a user with no statistics knowledge can understand.
In a fast-paced environment such as big data and IoT, where the type of data might vary over the course of time, it becomes difficult to maintain and re-create the models each and every time. This gap calls for an automated way to manage the data-science algorithms in those environments. The rise of data science was intended to move us away from a rules-based system to a system in which a machine learns rules for its automation by itself. Machine learning makes data science inherently partially automated. The half of data science that requires manual intervention is still to be automated. However, those are areas that involve the experience and wisdom of a people: a data scientist, a business expert, a software developer, a data integrator, everyone who currently contributes to making a data-science project operational. This makes it difficult to automate every aspect of data science. However, we can think of data science automation as a two level architecture, wherein:
– All the individual automated components are interconnected to form a coherent data-science system
We can think of a data-science system as automated when it’s capable enough to solve our problem whenever we throw a data set at it. Also, it should be intelligent enough to provide us with all possible solutions in a language that we can understand.
Data preparation, machine learning, domain knowledge, and result interpretation are four major tasks required to execute a data-science project successfully. All these tasks have to be converted to automated modules to create an automated data-science system (Figure 1).
Data preparation is a repetitive task that has to be done every time when creating models. Data extraction, data cleaning, and data transformations such as imputing null values and algorithm-specific transformations are some tasks that fall into this category. Many organizations automate these tasks and have branded the engine as a data science automation tool. However, most of these tools use rule-based logic for automating data-preprocessing tasks. Is this the right approach? Do we need rule-based systems to automate data science, which was born to end rule-based systems? Well, No. We need data preprocessing automated by machine learning itself. For example, the decision regarding what preprocessing function has to be applied on the data for a problem is to be made by machines themselves.
Feature engineering is another area of data preparation that requires automation. Feature engineering is a technique to convert raw data into attributes/predictors that improve the accuracy of a machine-learning project. Feature-engineering automation is still at a nascent stage and an active area of research. Data scientists from MIT are making incredible progress toward developing a “deep feature synthesis” algorithm capable of generating features from raw data.
This is an area of data-science automation where statistical routines are automated. The system executes the best algorithm based on the provided data set. It hides the intricacies and mathematical complexity of algorithms from the user, making it available to the masses. The user needs to provide the automated statistician with data.