As more companies recognize the need for a data science platform, more vendors are claiming they have one. Increasingly, we see companies describing their product as a “data science platform” without describing the features that make platforms so valuable. So we wanted to share our vision for the core capabilities a platform should have in order for it to be valuable to data science teams.
We see “the data science lifecycle” spanning three phases. Each phase has distinct demands that motivate capabilities for a data science platform:
To some degree, all data science projects go through these phases.
We’ll discuss these four lenses, describing the challenges involved in each, and what capabilities a good data science platform should provide.
Quantitative research starts with exploring the data to understand what you have. This might mean plotting data in different ways, examining different features, looking at the values of different variables, etc.
Ideation and exploration can be time consuming. The data sets can be large and unwieldy, or you may want to try new packages or tools. If you’re working on a team, unless you have ways of seeing work others have already done, you might be redoing work. Other people may have already developed insights, created clean data sets, or determined which features are useful and which are not.
Through the process of exploring data, researchers formulate ideas they want to test. At this point, research often shifts from ad hoc work in notebooks to more hardened, batch scripts. People run an experiment, review the results, and make changes based on what they’ve learned.
This phase can be slow when experiments are computationally intensive (e.g., model training tasks). This is also where the “science” part of data science can be especially important: tracking variations in your experiments, ensuring past results are reproducible, getting feedback through a peer review process.
Data science work is only valuable insofar as it creates some impact on business outcomes. That means the work must be operationalized or productionized somehow, i.e., it must be integrated into business processes or decision-making processes.