TPOT: A Python tool for automating data science
Machine learning is often touted as:
A field of study that gives computers the ability to learn without being explicitly programmed.
Despite this common claim, anyone who has worked in the field knows that designing effective machine learning systems is a tedious endeavor, and typically requires considerable experience with machine learning algorithms, expert knowledge of the problem domain, and brute force search to accomplish. Thus, contrary to what machine learning enthusiasts would have us believe, machine learning still requires a considerable amount of explicit programming.
In this article, we’re going to go over three aspects of machine learning pipeline design that tend to be tedious but nonetheless important. After that, we’re going to step through a demo for a tool that intelligently automates the process of machine learning pipeline design, so we can spend our time working on the more interesting aspects of data science.
Let’s get started.
Model hyperparameter tuning is important
One of the most tedious parts of machine learning is model hyperparameter tuning.
Support vector machines require us to select the ideal kernel, the kernel’s parameters, and the penalty parameter C. Artificial neural networks require us to tune the number of hidden layers, number of hidden nodes, and many more hyperparameters. Even random forests require us to tune the number of trees in the ensemble at a minimum.
All of these hyperparameters can have significant impacts on how well the model performs. For example, on the MNIST handwritten digit data set:
If we fit a random forest classifier with only 10 trees (scikit-learn’s default):
import pandas as pd import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.cross_validation import cross_val_score mnist_data = pd.read_csv('https://raw.githubusercontent.com/rhiever/Data-Analysis-and-Machine-Learning-Projects/master/tpot-demo/mnist.csv.gz', sep='t', compression='gzip') cv_scores = cross_val_score(RandomForestClassifier(n_estimators=10, n_jobs=-1), X=mnist_data.drop('class', axis=1).values, y=mnist_data.loc[:, 'class'].values, cv=10) print(cv_scores) [ 0.93461813 0.96287836 0.94688749 0.94072275 0.95114286 0.94570653 0.94884253 0.94311848 0.93825043 0.95668954] print(np.mean(cv_scores)) 0.946885709001
The random forest achieves an average of 94.7% cross-validation accuracy on MNIST. However, what if we tuned that hyperparameter a little bit and provided the random forest with 100 trees instead?
cv_scores = cross_val_score(RandomForestClassifier(n_estimators=100, n_jobs=-1), X=mnist_data.drop('class', axis=1).values, y=mnist_data.loc[:, 'class'].values, cv=10) print(cv_scores) [ 0.96259814 0.97829812 0.9684466 0.96700471 0.966 0.96399486 0.97113461 0.96755752 0.96397942 0.97684391] print(np.mean(cv_scores)) 0.968585789367
With such a minor change, we improved the random forest’s average cross-validation accuracy from 94.7% to 96.9%. This small improvement in accuracy can translate into millions of additional digits classified correctly if we’re applying this model on the scale of, say, processing addresses for the U.S. Postal Service.
Never use the defaults for your model. Hyperparameter tuning is vitally important for every machine learning project.
Model selection is important
We all love to think that our favorite model will perform well on every machine learning problem, but different models are better suited for different tasks.
For example, if we’re working on a signal processing problem where we need to classify whether there’s a “hill” or “valley” in the time series:
And we apply a “tuned” random forest to the problem:
import pandas as pd import numpy as np from sklearn.