How does the typical data science project life-cycle look like?
This post looks at practical aspects of implementing data science projects. It also assumes a certain level of maturity in big data (more on big data maturity models in the next post) and data science management within the organization. Therefore the life cycle presented here differs, sometimes significantly from purist definitions of ‘science’ which emphasize the hypothesis-testing approach. In practice, the typical data science project life-cycle resembles more of an engineering view imposed due to constraints of resources (budget, data and skills availability) and time-to-market considerations.
The CRISP-DM model (CRoss Industry Standard Process for Data Mining) has traditionally defined six steps in the data mining life-cycle. Data science is similar to data mining in several aspects, hence there’s some similarity with these steps.
The CRISP model steps are: 1. Business Understanding 2. Data Understanding 3. Data Preparation 4. Modeling 5. Evaluation and 6. Deployment
Given a certain level of maturity in big data and data science expertise within the organization, it is reasonable to assume availability of a library of assets related to data science implementations. Key among these are: 1. Library of business use-cases for big data/ data science applications 2. Data requirements – business use case mapping matrix 3. Minimum data quality requirements (test cases to ensure minimum level of data quality to ensure feasibility)
In most organizations, data science is a fledgling discipline, hence data scientists (except those from actuarial background) are likely to have limited business domain expertise – therefore they need to be paired with business people and those with expertise in understanding the data. This helps data scientists gain or work together on steps 1 and 2 of the CRISM-DM model – i.e. business understanding and data understanding.
The typical data science project then becomes an engineering exercise in terms of a defined framework of steps or phases and exit criteria, which allow making informed decisions on whether to continue projects based on pre-defined criteria, to optimize resource utilization and maximize benefits from the data science project. This also prevents the project from degrading into money-pits due to pursuing nonviable hypotheses and ideas.
The data science life-cycle thus looks somewhat like: 1. Data acquisition 2. Data preparation 3. Hypothesis and modeling 4. Evaluation and Interpretation 5. Deployment 6. Operations 7. Optimization
Data Acquisition – may involve acquiring data from both internal and external sources, including social media or web scraping.