This post looks at perhaps the most important, and often overlooked, step in learning machine learning, an aspect which can make the biggest difference in one’s skill set.
The web is full of good explanations of machine learning algorithms. And every second applicant for a data science position has finished the Coursera course on machine learning. While it is important to understand the concepts behind the algorithms, one thing is even more important:
Theory will not help you choose good values for the 16 parameters a standard implementation of a random forest takes. The default values are good to get started, but which parameters should you modify depending on your data?
Choosing the right features, algorithms and parameters is an art. It’s actually more like Karate than like math. You won’t learn it from a book. You learn it by doing, by getting your hands dirty and applying algorithms to various data sets. By lots of trial and error. By having seen hundreds of successful applications.
Take the people winning Kaggle competitions, for example. One might think they are the researchers leading in the field of machine learning. But in fact, most of them spent a lot of time on Kaggle working with actual data. Check by yourself. Kaggle publishes profiles of top kagglers on their blog.
To get better at applying machine learning techniques, here is the one simple step I recommend. While it is simple as a concept, it will (of course) take perseverance and many hours of work.
It is easy to get this one wrong though. The prize money and the leaderboard could easily make you think, it is about winning kaggle competitions. The inglorious truth is, winning is a distraction. Winning a data science competition needs skills that are not relevant to a data science job. You need to build overly complex models and squeeze out the last fractions of a percentage point. Even the winning model of the Netflix prize has not been used in practice.;