tpot-demo-mnist-digits

TPOT: A Python tool for automating data science

TPOT: A Python tool for automating data science

TPOT: A Python tool for automating data science
Posted on
Machine learning is often touted as:
A field of study that gives computers the ability to learn without being explicitly programmed.
Despite this common claim, anyone who has worked in the field knows that designing effective machine learning systems is a tedious endeavor, and typically requires considerable experience with machine learning algorithms, expert knowledge of the problem domain, and brute force search to accomplish. Thus, contrary to what machine learning enthusiasts would have us believe, machine learning still requires a considerable amount of explicit programming.
In this article, we’re going to go over three aspects of machine learning pipeline design that tend to be tedious but nonetheless important. After that, we’re going to step through a demo for a tool that intelligently automates the process of machine learning pipeline design, so we can spend our time working on the more interesting aspects of data science.
Let’s get started.
Model hyperparameter tuning is important
One of the most tedious parts of machine learning is model hyperparameter tuning.
Support vector machines require us to select the ideal kernel, the kernel’s parameters, and the penalty parameter C. Artificial neural networks require us to tune the number of hidden layers, number of hidden nodes, and many more hyperparameters. Even random forests require us to tune the number of trees in the ensemble at a minimum.
All of these hyperparameters can have significant impacts on how well the model performs. For example, on the MNIST handwritten digit data set:
If we fit a random forest classifier with only 10 trees (scikit-learn’s default):
import pandas as pd import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.cross_validation import cross_val_score mnist_data = pd.read_csv('https://raw.githubusercontent.com/rhiever/Data-Analysis-and-Machine-Learning-Projects/master/tpot-demo/mnist.csv.gz', sep='t', compression='gzip') cv_scores = cross_val_score(RandomForestClassifier(n_estimators=10, n_jobs=-1), X=mnist_data.drop('class', axis=1).values, y=mnist_data.loc[:, 'class'].values, cv=10) print(cv_scores) [ 0.93461813 0.96287836 0.94688749 0.94072275 0.95114286 0.94570653 0.94884253 0.94311848 0.93825043 0.95668954] print(np.mean(cv_scores)) 0.946885709001
The random forest achieves an average of 94.7% cross-validation accuracy on MNIST. However, what if we tuned that hyperparameter a little bit and provided the random forest with 100 trees instead?
cv_scores = cross_val_score(RandomForestClassifier(n_estimators=100, n_jobs=-1), X=mnist_data.drop('class', axis=1).values, y=mnist_data.loc[:, 'class'].values, cv=10) print(cv_scores) [ 0.96259814 0.97829812 0.9684466 0.96700471 0.966 0.96399486 0.97113461 0.96755752 0.96397942 0.97684391] print(np.mean(cv_scores)) 0.968585789367
With such a minor change, we improved the random forest’s average cross-validation accuracy from 94.7% to 96.9%. This small improvement in accuracy can translate into millions of additional digits classified correctly if we’re applying this model on the scale of, say, processing addresses for the U.S. Postal Service.
Never use the defaults for your model. Hyperparameter tuning is vitally important for every machine learning project.
Model selection is important
We all love to think that our favorite model will perform well on every machine learning problem, but different models are better suited for different tasks.
For example, if we’re working on a signal processing problem where we need to classify whether there’s a “hill” or “valley” in the time series:
And we apply a “tuned” random forest to the problem:
import pandas as pd import numpy as np from sklearn.

Read Also:
What Recruiters and Hiring Managers Are Looking for in a Data Scientist

 



Chief Analytics Officer Europe

25
Apr
2017
Chief Analytics Officer Europe

15% off with code 7WDCAO17

Read Also:
The best of big data NoSQL: MongoDB, Amazon, and DataStax top Forrester list

Chief Analytics Officer Spring 2017

2
May
2017
Chief Analytics Officer Spring 2017

15% off with code MP15

Read Also:
​25 Big Data Terms You Must Know To Impress Your Date (Or whoever you want to)

Big Data and Analytics for Healthcare Philadelphia

17
May
2017
Big Data and Analytics for Healthcare Philadelphia

$200 off with code DATA200

Read Also:
Will Artificial Intelligence Defeat Cancer?

SMX London

23
May
2017
SMX London

10% off with code 7WDATASMX

Read Also:
Majority of big data projects are not profitable, especially if IT is in charge

Data Science Congress 2017

5
Jun
2017
Data Science Congress 2017

20% off with code 7wdata_DSC2017

Read Also:
Why Is Artificial Intelligence So Bad At Empathy?

Leave a Reply

Your email address will not be published. Required fields are marked *