With growing interest in neural networks and deep learning, individuals and companies are claiming ever-increasing adoption rates of artificial intelligence into their daily workflows and product offerings.
Coupled with breakneck speeds in AI-research, the new wave of popularity shows a lot of promise for solving some of the harder problems out there.
That said, I feel that this field suffers from a gulf between appreciating these developments and subsequently deploying them to solve "real-world" tasks.
A number of frameworks, tutorials and guides have popped up to democratize machine learning, but the steps that they prescribe often don't align with the fuzzier problems that need to be solved.
This post is a collection of questions (with some (maybe even incorrect) answers) that are worth thinking about when applying machine learning in production.
While starting out, most tutorials usually include well-defined datasets. Whether it be MNIST, the Wikipedia corpus or any of the great options from the UCI Machine Learning Repository, these datasets are often not representative of the problem that you wish to solve.
For your specific use case, an appropriate dataset might not even exist and building a dataset could take much longer than you expect.
For example, at Semantics3, we tackle a number of ecommerce-specific problems ranging from product categorization to product matching to search relevance. For each of these problems, we had to look within and spend considerable effort to generate high-fidelity product datasets.
In many cases, even if you possess the required data, significant (and expensive) manual labor might be required to categorize, annotate and label your data for training.
This is another step, often independent of the actual models, that is glossed over in most tutorials. Such omissions appear even more glaring when exploring deep neural networks, where transforming the data into usable "input" is crucial.
While there exist some standard techniques for images, like cropping, scaling, zero-centering and whitening - the final decision is still up to individuals on the level of normalization required for each task.
The field gets even messier when working with text. Is capitalization important? Should I use a tokenizer? What about word embeddings? How big should my vocabulary and dimensionality be? Should I use pre-trained vectors or start from scratch or layer them?
There is no right answer applicable across all situations, but keeping abreast of available options is often half the battle. A recent post from the creator of spaCy details an interesting strategy to standardize deep learning for text.
This might be the question with the most opinionated answers. I am including this section here only for completeness and would gladly point you to the various other resources available for making this decision.
While each person might have different criteria for evaluation, mine has simply been ease of customization, prototyping and testing. In that aspect, I prefer to start with scikit-learn where possible and use Keras for my deep learning projects.
Data Innovation Summit 2017
30% off with code 7wData
Big Data Innovation Summit London
$200 off with code DATA200
Enterprise Data World 2017
$200 off with code 7WDATA
Data Visualisation Summit San Francisco
$200 off with code DATA200
Chief Analytics Officer Europe
15% off with code 7WDCAO17