tpot-demo-mnist-digits

TPOT: A Python tool for automating data science

TPOT: A Python tool for automating data science

TPOT: A Python tool for automating data science
Posted on
Machine learning is often touted as:
A field of study that gives computers the ability to learn without being explicitly programmed.
Despite this common claim, anyone who has worked in the field knows that designing effective machine learning systems is a tedious endeavor, and typically requires considerable experience with machine learning algorithms, expert knowledge of the problem domain, and brute force search to accomplish. Thus, contrary to what machine learning enthusiasts would have us believe, machine learning still requires a considerable amount of explicit programming.
In this article, we’re going to go over three aspects of machine learning pipeline design that tend to be tedious but nonetheless important. After that, we’re going to step through a demo for a tool that intelligently automates the process of machine learning pipeline design, so we can spend our time working on the more interesting aspects of data science.
Let’s get started.
Model Hyperparameter tuning is important
One of the most tedious parts of machine learning is model Hyperparameter tuning.
Support vector machines require us to select the ideal kernel, the kernel’s parameters, and the penalty parameter C. Artificial neural networks require us to tune the number of hidden layers, number of hidden nodes, and many more hyperparameters. Even random forests require us to tune the number of trees in the ensemble at a minimum.
All of these hyperparameters can have significant impacts on how well the model performs. For example, on the MNIST handwritten digit data set:
If we fit a random forest classifier with only 10 trees (scikit-learn’s default):
import pandas as pd import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.cross_validation import cross_val_score mnist_data = pd.read_csv('https://raw.githubusercontent.com/rhiever/Data-Analysis-and-Machine-Learning-Projects/master/tpot-demo/mnist.csv.gz', sep='t', compression='gzip') cv_scores = cross_val_score(RandomForestClassifier(n_estimators=10, n_jobs=-1), X=mnist_data.drop('class', axis=1).values, y=mnist_data.loc[:, 'class'].values, cv=10) print(cv_scores) [ 0.93461813 0.96287836 0.94688749 0.94072275 0.95114286 0.94570653 0.94884253 0.94311848 0.93825043 0.95668954] print(np.mean(cv_scores)) 0.946885709001
The random forest achieves an average of 94.7% cross-validation accuracy on MNIST. However, what if we tuned that hyperparameter a little bit and provided the random forest with 100 trees instead?
cv_scores = cross_val_score(RandomForestClassifier(n_estimators=100, n_jobs=-1), X=mnist_data.drop('class', axis=1).values, y=mnist_data.loc[:, 'class'].values, cv=10) print(cv_scores) [ 0.96259814 0.97829812 0.9684466 0.96700471 0.966 0.96399486 0.97113461 0.96755752 0.96397942 0.97684391] print(np.mean(cv_scores)) 0.968585789367
With such a minor change, we improved the random forest’s average cross-validation accuracy from 94.7% to 96.9%. This small improvement in accuracy can translate into millions of additional digits classified correctly if we’re applying this model on the scale of, say, processing addresses for the U.S. Postal Service.
Never use the defaults for your model. Hyperparameter tuning is vitally important for every machine learning project.
Model selection is important
We all love to think that our favorite model will perform well on every machine learning problem, but different models are better suited for different tasks.
For example, if we’re working on a signal processing problem where we need to classify whether there’s a “hill” or “valley” in the time series:
And we apply a “tuned” random forest to the problem:
import pandas as pd import numpy as np from sklearn.

Read Also:
9 Causes Of Data Misinterpretation

 



Data Science Congress 2017

5
Jun
2017
Data Science Congress 2017

20% off with code 7wdata_DSC2017

Read Also:
Interacting with Machine Learning  – Here is Why You Should Care

AI Paris

6
Jun
2017
AI Paris

20% off with code AIP17-7WDATA-20

Read Also:
The Power of Data and Collaboration to Improve Traffic Safety

Chief Data Officer Summit San Francisco

7
Jun
2017
Chief Data Officer Summit San Francisco

$200 off with code DATA200

Read Also:
Advances in AI and ML are reshaping healthcare

Customer Analytics Innovation Summit Chicago

7
Jun
2017
Customer Analytics Innovation Summit Chicago

$200 off with code DATA200

Read Also:
Advances in AI and ML are reshaping healthcare

HR & Workforce Analytics Innovation Summit 2017 London

12
Jun
2017
HR & Workforce Analytics Innovation Summit 2017 London

$200 off with code DATA200

Read Also:
Why Image Analytics Holds the Key to Better Big Data Analysis

Leave a Reply

Your email address will not be published. Required fields are marked *