During my 30 years of analytics career, prospective employers and clients have often asked me: “How can you help us with data-driven insights when you have not worked in this industry before?”. I argue for greater emphasis on machine learning skills in the data scientist and their partnership with domain experts as an effective pathway to bring data science to a business.
Clearly, the description of data scientist as the mythical unicorn who has computer science skills, statistical knowledge and domain expertise (Figure 1) has had an impact. The proliferation of different analytics disciplines such as social network analysis, digital analytics, bio-informatics and supply chain analytics, lends weight to the argument that domain expertise definitely matters.
There are also anecdotes on the web of data science projects that went pear shaped because the analysts were not subject matter experts. A deeper look into these anecdotes reveals that the issues are not due to a lack of domain expertise, but due to poor data science such as over-fitting of data, bad sampling methods and unnecessary data cleansing. Still the myth that domain expertise trumps all else continues!!
Data mining competitions such as Kaggle and KDD have demonstrated the opposite and shown how data science can be successfully outsourced to people without domain expertise. Many companies have run competitions on such diverse topics as optimizing flight routes, predicting ocean health and diabetic retinopathy detection. Data scientists with little or no expertise in the domain have responded brilliantly with useful solutions. Adam Kowalczyk and I won the KDD Cup on yeast gene regulation prediction with no background in biology. Some data scientists, such as David Vogel and Claudia Perlisch, have even won across multiple domains, indicating that data science skills are transferable across domains.
The counter argument to Kaggle’s success is that in these competitions, the domain experts have already generated the hypothesis by posing the right business question and preparing the data (Figure 2), and the competitors need only model and test. But, in the brave new world of massive data along with the mathematical tools and computing power to crunch these numbers, old world paradigm of hypothesizing before modeling is likely to be challenged. Google has shown a whole new way of understanding the world without any a priorimodels or theories with their approach to language learning.