How will data science teams maintain quality standards in the face of advancing automation? Attend the IBM DataFirst Launch Event on Sep 27 in NYC and learn how to drive greater productivity from your data science teams without compromising the quality of the mission-critical business assets they produce.
Productivity in data science isn’t a matter of output in any quantitative sense. It’s more an issue of the quality of what data scientists produce.
In a data-science context, quality refers to the validity and relevance of the insights that statistical models are able to distill from the data. As I stated recently, the more data you have, the more stories that data scientists can tell with it, though many of those narratives may be entirely (albeit inadvertently) fictitious.
Given the paramount importance of actionable insights, the productivity of data science teams can’t be neatly reduced to throughput or any other quantitative metric. Data scientists can easily pump up their aggregate output along myriad dimensions, such as more sources, more data, more pipeline processes, more variables, more iterations, and more visualizations. But that doesn’t necessarily get them any closer to delivering high-quality analytics for predictive, prescriptive, and other uses.
Likewise, you can’t always assume that throwing more data scientists at a problem will boost the quality of their collective output. Different data scientists may rely on different data sources; aggregate, cleanse, and sample them in different ways; incorporate different feature sets into their models, use different algorithmic and visualization approaches; employ different metrics of model fitness and predictive capability, and so on. Lack of standardized approaches and consistent documentation may frustrate efforts by third parties to compare which data scientists’ results, if anybody’s, are most valid in any particular instance of them working on a common problem.
As the number and variety of data scientists at work on a problem grows, they may be surfacing more spurious correlations than causal insights. And if they use different methodologies to produce their various results, it may get more difficult to identify who, if anyone, has delivered the highest-quality insights.
If you throw “citizen data scientists” into the mix, the quality problem can easily deteriorate unless you implement safeguard procedures (to be discussed later in this post). Citizen data scientist refers to the new generation of statistical explorers who lack the traditional academic and work backgrounds of established data scientists. Typically, these newbies are self-taught, self-starting, self-sufficient, and use self-service cloud-based statistical modeling and data engineering tools. They tend to use idiosyncratic methods, which may frustrate subsequent efforts by others to assess the validity of their findings.