Data science is fast becoming a critical skill for developers and managers across industries, and it looks like a lot of fun as well. But it’s pretty complicated - there are a lot of engineering and analytical options to navigate, and it’s hard to know if you’re doing it right or where the bear traps lie. In this series we explore ways in to making sense of data science - understanding where it’s needed and where it’s not, and how to make it an asset for you, from people who’ve been there and done it.
Enterprises are increasingly realising that many of their most pressing business problems could be tackled with the application of a little data science. Where a few years ago you might have threatened to replace someone with a very small shell script, you now could replace some more people with a very small predictive model.
Data science is a catch-all term for a set of interdisciplinary techniques which put data to work to extract useful insights, predictions, or knowledge - calling on elements of statistics, programming, data mining, and machine learning. It shows up in a large variety of areas, some that are literally rocket science while others are much more prosaic. Data science is behind consumer internet magic like Amazon’s book recommendations or LinkedIn’s People You May Know. It’s behind new things like self-driving cars, which use these techniques to learn how to drive safely. And it’s behind day to day practical applications like a supermarket loyalty scheme, such as Tesco’s Clubcard, figuring out which vouchers to send you.
The theory behind most of these applications has been around for decades. However, it’s only in the last ten years, with the advent of cheap cloud servers by the hour, ubiquitous data collection, distributed storage and processing, and battle-tested machine learning libraries, that applying data science in an everyday business has been a good practical choice. It’s an exciting time to solve old business problems using new data science.
However, business problems are often vaguely defined, complicated, and have success conditions and dependencies which mean that only certain types of model or levels of precision (fraction of positive predictions which are indeed correct) and recall (fraction of the true positives which are found by the model) will solve them. This article walks through some of the most common challenges and best angles of attack for the technical person trying to make that application of data the best it can be.
One of the biggest determinants of success is choosing and defining a good problem to work on. So what does “good” mean in this area? It means:
Data science is a catch-all term, but there are some specific groupings of techniques:
There are numerous good open source data science libraries now, mostly written in Python, Java, or C++. Over the last year there has in particular been a boom in interesting deep learning tools, particularly Google’s TensorFlow, for sophisticated and very large scale machine learning such as image recognition, and some very impressive results in the field of AI. Tempting as these cool toys are, for most applications the smart initial choice will be to pick a much simpler model, for example using scikit-learn (a popular Python library which includes tools for the most common data mining and data analysis tasks) and modelling techniques like simple logistic regression - YAGNI applies double in the world of data science.